Install bigWig and latticeExtra package
install.packages("devtools", quiet = TRUE)
## also installing the dependencies 'Rcpp', 'utf8', 'askpass', 'credentials', 'openssl', 'sys', 'zip', 'gitcreds', 'httr2', 'ini', 'httpuv', 'xtable', 'sourcetools', 'later', 'promises', 'fansi', 'systemfonts', 'textshaping', 'pillar', 'pkgconfig', 'diffobj', 'rematch2', 'clipr', 'crayon', 'curl', 'gert', 'gh', 'purrr', 'rprojroot', 'rstudioapi', 'whisker', 'shiny', 'callr', 'processx', 'downlit', 'httr', 'ragg', 'tibble', 'xml2', 'htmlwidgets', 'prettyunits', 'xopen', 'brew', 'commonmark', 'cpp11', 'brio', 'praise', 'ps', 'waldo', 'usethis', 'desc', 'miniUI', 'pkgbuild', 'pkgdown', 'pkgload', 'profvis', 'rcmdcheck', 'remotes', 'roxygen2', 'rversions', 'sessioninfo', 'testthat', 'urlchecker', 'withr'
library(devtools)
## Loading required package: usethis
devtools::install_github('andrelmartins/bigWig',
subdir='bigWig')
## Downloading GitHub repo andrelmartins/bigWig@HEAD
## ── R CMD build ─────────────────────────────────────────────────────────────────
## * checking for file ‘/private/tmp/RtmpUWt6hK/remotese6857926fb4/andrelmartins-bigWig-beac5a7/bigWig/DESCRIPTION’ ... OK
## * preparing ‘bigWig’:
## * checking DESCRIPTION meta-information ... OK
## * cleaning src
## * checking for LF line-endings in source and make files and shell scripts
## * checking for empty or unneeded directories
## * building ‘bigWig_0.2-9.tar.gz’
library(bigWig)
install.packages("latticeExtra", quiet = TRUE)
## also installing the dependencies 'deldir', 'RcppEigen', 'png', 'jpeg', 'RColorBrewer', 'interp'
install.packages("DESeq2", quiet = TRUE)
## Warning: package 'DESeq2' is not available for this version of R
##
## A version of this package for your version of R might be available elsewhere,
## see the ideas at
## https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages
Install bedtools
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
brew install bedtools
In previous analysis, we have showed that using a subset of GATA3-peak with known enriched GATA3 motif (GAT—-ATC), we can successfully generate a composite profile with expected GAT to ATC 3mer distribution. –This is answering if we could build a pipeline to successfully demonstrate a known 3mer-3mer structure (two 3mer sites with fixed number of spacing in between).
With this analysis pipeline in hand, the real question we want to ask is:
1) Is our GATA3 ChIP-seq library has predominant enrichment of such 3mer-3mer (GAT-GAT or GAT-ATC) structure compared to any other library that are not ChIPed by GATA3?
2) Is this GAT-GAT or GAT-ATC structure more common to be observed than any other 3mer combinations?
Background: we have GATA3 peak that have exhaustively searched for GATA3-like motif with MEME/STREME software. However, there are ~40% peak failed in finding a GATA3-like motif. Even for peak subset that have relatively high peak intensity (top5%), we still have ~20% peaks that MEME/STREME cannot find a GATA3-like motif.
Generally, we would expect to find binding sequences within the peak region, so that the ChIPed transcription factor (in our case, GATA3) can bind to. This is even more true for peaks with relatively high intensity.
Our question now became, is MEME/STREME limited in finding binding sequences patterned like GATA3 (that has fixed spacings between two 3mer)? Can we find the 3mer-3mer sequences in GATA3 peaks that do not contain the MEME/STREME found GATA3-like motifs?
peaks without 9 GATA3-motif (that found by MEME/STREME)
wc -l without_motifs_123456_789.bed
head -5 without_motifs_123456_789.bed
## 38980 without_motifs_123456_789.bed
## chr1 827280 827481 GATA_ChIP_peak_28 9.30478
## chr1 916669 916870 GATA_ChIP_peak_31 7.79887
## chr1 924753 924954 GATA_ChIP_peak_33 3.78065
## chr1 966553 966754 GATA_ChIP_peak_34 3.78065
## chr1 999408 999609 GATA_ChIP_peak_36 2.11515
Peak Intensity
In the above file, the last column is the peak intensity reported by MACS3.
Notice that a peak with high MACS3-intensity is not necessarily a intense peak. There are two things we need to consider regarding peak intensities. The first is peak region; the second is dynamic range. Imaging a peak that is very intense but within a narrow range, the other peak is not so intense but can span its signals across a very long distance. We need to consider this region differences while calling any peak to be intense or not.
First use Sam Flag 0x3(reads paired and mapped in proper pair) to calculate the read depth for each GATA library.
# calculate the size factors
module load samtools/1.12
dir=/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/sorted.bam_final/
for i in GATA
do
echo $i
> ${i}_header.txt
> ${i}_reads.txt
for j in ${dir}MCF7_dTAGGATA522*_${i}_*.sorted.bam
do
echo $j
name=$(echo $j | awk -F $dir '{print $2}' | awk -F".sorted.bam" '{print $1}')
echo $name | paste ${i}_header.txt - > ${i}_tmp.txt
mv ${i}_tmp.txt ${i}_header.txt
reads=`samtools view -c -f 0x3 $j` #count the reads paired and mapped in proper pair
echo $reads | paste ${i}_reads.txt - > ${i}_tmp.txt
mv ${i}_tmp.txt ${i}_reads.txt
done
cat ${i}_header.txt ${i}_reads.txt > ${i}_tmp.txt
mv ${i}_tmp.txt ${i}_reads.txt
rm ${i}_header.txt
done
cat GATA_reads.txt
## MCF7_dTAGGATA522_GATA_CC_rep1 MCF7_dTAGGATA522_GATA_CC_rep2 MCF7_dTAGGATA522_GATA_CC_rep3 MCF7_dTAGGATA522_GATA_CE_rep1 MCF7_dTAGGATA522_GATA_CE_rep2 MCF7_dTAGGATA522_GATA_CE_rep3 MCF7_dTAGGATA522_GATA_dE_rep1 MCF7_dTAGGATA522_GATA_dE_rep2 MCF7_dTAGGATA522_GATA_dE_rep3
## 33948616 32585396 34475586 51112588 147834968 136838760 34136142 136271358 85665512
bargraph
library(lattice)
df=as.data.frame(t(read.table("GATA_reads.txt", header=F)))
colnames(df)=c("library", "aligned_reads")
df$libraey=as.factor(df$library)
df$aligned_reads=as.numeric((df$aligned_reads))
barchart(aligned_reads ~ library,
data = df,
ylim=c(0, max(df$aligned_reads)*1.04),
col = "skyblue",
scales = list(x = list(rot = 45)),
xlab = "GATA3 ChIP library",
ylab = "concordantly aligned reads"
)
This bar graph represents the read depth in each GATA3 ChIP-seq library. In the downstream analysis, we will need to use DESeq2 to normalize the counts in each library with size factor to account for this read depth difference.
Get peak intensity within 400bp window withDeseq2 for peaks without MEME/STREME found motifs 12345678.
load peaks without motif 12345678.
a = read.table('without_motifs_123456_789.bed', sep = "\t", header=FALSE)
nrow(a)
## [1] 38980
head(a)
## V1 V2 V3 V4 V5
## 1 chr1 827280 827481 GATA_ChIP_peak_28 9.30478
## 2 chr1 916669 916870 GATA_ChIP_peak_31 7.79887
## 3 chr1 924753 924954 GATA_ChIP_peak_33 3.78065
## 4 chr1 966553 966754 GATA_ChIP_peak_34 3.78065
## 5 chr1 999408 999609 GATA_ChIP_peak_36 2.11515
## 6 chr1 1000436 1000637 GATA_ChIP_peak_37 5.52004
Increase the width to 400bp window.
library(bigWig)
peak.region.400win=center.bed(a, upstreamWindow = 200, downstreamWindow = 200)
nrow(peak.region.400win)
## [1] 38980
head(peak.region.400win)
## V1 V2 V3 V4 V5
## 1 chr1 827180 827581 GATA_ChIP_peak_28 9.30478
## 2 chr1 916569 916970 GATA_ChIP_peak_31 7.79887
## 3 chr1 924653 925054 GATA_ChIP_peak_33 3.78065
## 4 chr1 966453 966854 GATA_ChIP_peak_34 3.78065
## 5 chr1 999308 999709 GATA_ChIP_peak_36 2.11515
## 6 chr1 1000336 1000737 GATA_ChIP_peak_37 5.52004
In the below chunk, we define a function to get raw counts from each sample/reps. This function uses bed.region.bpQuery.bigWig from bigWig package. It requires a bed region file that has the peak region that we want to analysis (right now we want to analyse peak without motifs 123456789); it also requires non-normalized, raw bigWig files (generated directly from seqOutbias) to allow bed.region.bpQuery.bigWig to query the counts information from. The output from this function will be: rows will be the same as bed file, columns will be each bigWig library, entries will be the raw counts.
#functions on github
source('https://raw.githubusercontent.com/mjg54/znf143_pro_seq_analysis/master/docs/ZNF143_functions.R')
#function
get.counts.interval <- function(df, path.to.bigWig, file.prefix = 'H') {
vec.names = c()
inten.df=data.frame(matrix(ncol = 0, nrow = nrow(df)))
for (mod.bigWig in Sys.glob(file.path(path.to.bigWig, paste(file.prefix, "*.bigWig", sep ='')))) {
factor.name = strsplit(strsplit(mod.bigWig, "/")[[1]][length(strsplit(mod.bigWig, "/")[[1]])], '.bigWig')[[1]][1]
print(factor.name)
vec.names = c(vec.names, factor.name)
loaded.bw = load.bigWig(mod.bigWig)
print(mod.bigWig)
mod.inten = bed.region.bpQuery.bigWig(loaded.bw, df, abs.value = TRUE)
inten.df = cbind(inten.df, mod.inten)
}
colnames(inten.df) = vec.names
r.names = paste(df[,1], ':', df[,2], '-', df[,3], sep='')
row.names(inten.df) = r.names
return(inten.df)
}
We will first use the defined function to query raw counts from each non-normalized bigWig files of GATA ChIP-seq use peak region info loaded before (data frame “peak.region.400win”).
library(bigWig)
#non-normalized counts
GATA.counts.df= get.counts.interval(peak.region.400win, "/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/bigWigs/Seqoutbias_bw","MCF") #25 libraries
#nrow(peak.region.400win)
#[1] 38980
#nrow(GATA.counts.df)
#[1] 38980
#head(GATA.counts.df)
#colnames(GATA.counts.df)
GATA.analysis.regions=GATA.counts.df[,grepl("_GATA_",colnames(GATA.counts.df))] # get non-normalized counts from "GATA" libraries
#colnames(GATA.analysis.regions)
#[1] "MCF7_dTAGGATA522_GATA_CC_rep1" "MCF7_dTAGGATA522_GATA_CC_rep2"
#[3] "MCF7_dTAGGATA522_GATA_CC_rep3" "MCF7_dTAGGATA522_GATA_CE_rep1"
#[5] "MCF7_dTAGGATA522_GATA_CE_rep2" "MCF7_dTAGGATA522_GATA_CE_rep3"
#[7] "MCF7_dTAGGATA522_GATA_dE_rep1" "MCF7_dTAGGATA522_GATA_dE_rep2"
#[9] "MCF7_dTAGGATA522_GATA_dE_rep3"
identical(rownames(GATA.analysis.regions),rownames(GATA.counts.df)) # [1] TRUE
Then we use the DESeq2 package to make a counts matrix (DESeqDataSetFromMatrix) and calculate size factors for each library (estimateSizeFactorsForMatrix) use the previously calculated read depth for each library (“GATA_reads.txt”); We use this size factor to normalize the counts (sizeFactors).
Then we use rowMeans to average the normalized counts for the three GATA_CC reps, and save it as “peak.intensities”; this normalized counts now can be use to determine if a peak is intense or not.
library(DESeq2)
sample.conditions = factor(sapply(strsplit(colnames(GATA.analysis.regions), '_rep'), '[', 1))
#[1] MCF7_dTAGGATA522_GATA_CC MCF7_dTAGGATA522_GATA_CC MCF7_dTAGGATA522_GATA_CC
#[4] MCF7_dTAGGATA522_GATA_CE MCF7_dTAGGATA522_GATA_CE MCF7_dTAGGATA522_GATA_CE
#[7] MCF7_dTAGGATA522_GATA_dE MCF7_dTAGGATA522_GATA_dE MCF7_dTAGGATA522_GATA_dE
#3 Levels: MCF7_dTAGGATA522_GATA_CC ... MCF7_dTAGGATA522_GATA_dE
deseq.counts.table = DESeqDataSetFromMatrix(countData = GATA.analysis.regions, # DESeqDataSet needs countData to be non-negative integers; non-normalized counts are integer, normalized signals has decimals.
colData = as.data.frame(sample.conditions),
design = ~ sample.conditions)
GATA.SF <- read.table("GATA_reads.txt", sep = '\t', header = TRUE)[,-1] # GATA size factors from read depth
#MCF7_dTAGGATA522_GATA_CC_rep1 MCF7_dTAGGATA522_GATA_CC_rep2
# 33948616 32585396
# MCF7_dTAGGATA522_GATA_CC_rep3 MCF7_dTAGGATA522_GATA_CE_rep1
# 34475586 51112588
# MCF7_dTAGGATA522_GATA_CE_rep2 MCF7_dTAGGATA522_GATA_CE_rep3
# 147834968 136838760
# MCF7_dTAGGATA522_GATA_dE_rep1 MCF7_dTAGGATA522_GATA_dE_rep2
# 34136142 136271358
# MCF7_dTAGGATA522_GATA_dE_rep3
# 85665512
GATA.size.factors = estimateSizeFactorsForMatrix(GATA.SF) # read depth transformed to size factor
#MCF7_dTAGGATA522_GATA_CC_rep1 MCF7_dTAGGATA522_GATA_CC_rep2
# 0.5385594 0.5169334
#MCF7_dTAGGATA522_GATA_CC_rep3 MCF7_dTAGGATA522_GATA_CE_rep1
# 0.5469193 0.8108480
#MCF7_dTAGGATA522_GATA_CE_rep2 MCF7_dTAGGATA522_GATA_CE_rep3
# 2.3452478 2.1708044
#MCF7_dTAGGATA522_GATA_dE_rep1 MCF7_dTAGGATA522_GATA_dE_rep2
# 0.5415343 2.1618032
#MCF7_dTAGGATA522_GATA_dE_rep3
# 1.3589941
sizeFactors(deseq.counts.table) <- GATA.size.factors # assign to each column of the count matrix (deseq.counts.table) the size factor to bring each column to a common scale
dds <- DESeq(deseq.counts.table)
normalized.counts.GATA3 = counts(dds, normalized=TRUE)
head(normalized.counts.GATA3)
peak.intensities = rowMeans(normalized.counts.GATA3[,1:3]) # we want to get the average read counts for CC groups
names(peak.intensities) = rownames(normalized.counts.GATA3)
save.image('231205_GATA3_ChIP_deseq.Rdata')
Subset peaks without the 8 GATA3-motifs, in top20% intensity quantile
We can load the saved Rdata and look at each dataframe.
load('231205_GATA3_ChIP_deseq.Rdata')
## Warning: namespace 'DESeq2' is not available and has been replaced
## by .GlobalEnv when processing object 'dds'
Read this .bed file into R, and use DeSeq2 to count read size and parse into different quantile
head(peak.intensities)
## chr1:827180-827581 chr1:916569-916970 chr1:924653-925054
## 45.32469 36.69993 30.44684
## chr1:966453-966854 chr1:999308-999709 chr1:1000336-1000737
## 26.87016 21.22455 24.34263
quantile(peak.intensities, probs = seq(.25, 1.00, by = .25))
## 25% 50% 75% 100%
## 34.13704 46.04222 71.43499 6270.21535
quantile(peak.intensities, probs = seq(.20, 1.00, by = .20))
## 20% 40% 60% 80% 100%
## 31.94822 40.59628 52.89014 83.30172 6270.21535
violin plot
This violin plot is showing the distribution of peak intensity (log10).
library(lattice)
log10quantiles <- quantile(log(abs(peak.intensities), base = 10), probs = seq(.20, 1.00, by = .20))
png('violinplot_GATA3_ChIP_peak_normalized_intensity.png')
print(bwplot(log(abs(peak.intensities), base = 10) ~ factor("1"),
main = "Violin-Like Plot",
panel = function(x, ...) {
panel.violin(x, ...)
panel.abline(h = log10quantiles, col = "red", lty = 2)
},
xlab = "", ylab = "log10 normalized Intensity")
)
Figure 1: normalized GATA3 ChIP peak intensity
This violin plot represents the distribution and probability density of (log10) normalized peak intensity (for peak without motif 1~9). Each red dotted line indicates the quantile cutoff of 20%, which can help us to visualize the peak intensity distribution across different quantile.
Parse peaks into 5 intensity quantile by 20%.
chr = sapply(strsplit(names(peak.intensities), ":"), "[", 1)
rnge = sapply(strsplit(names(peak.intensities), ":"), "[", 2)
start = as.numeric(sapply(strsplit(rnge, "-"), "[", 1)) + 200
end = as.numeric(sapply(strsplit(rnge, "-"), "[", 2)) - 200
quantile(peak.intensities, probs = seq(.20, 1.00, by = .20))
## 20% 40% 60% 80% 100%
## 31.94822 40.59628 52.89014 83.30172 6270.21535
# 1bp summit quantile file
j =0
q=seq(.20, 1.00, by = .20)
count=0
for (i in quantile(peak.intensities, probs = seq(.20, 1.00, by = .20))){
count = count +1
write.table(file = paste0('quantile', as.character(q[count]), '_summits.bed'), data.frame(chr[peak.intensities > j & peak.intensities <= i], start[peak.intensities > j & peak.intensities <= i], end[peak.intensities > j & peak.intensities <= i], peak.intensities[peak.intensities > j & peak.intensities <= i]), sep = '\t', quote=FALSE, col.names=FALSE, row.names=FALSE )
j = i
}
for i in *quantile*.bed
do
wc -l $i
done
## 7594 all.distance.quantile0.2.bed
## 7796 all.distance.quantile0.4.bed
## 7796 all.distance.quantile0.6.bed
## 7796 all.distance.quantile0.8.bed
## 7796 all.distance.quantile1.bed
## 7563 closest.2nd.distance.quantile0.2.bed
## 7781 closest.2nd.distance.quantile0.4.bed
## 7785 closest.2nd.distance.quantile0.6.bed
## 7781 closest.2nd.distance.quantile0.8.bed
## 7791 closest.2nd.distance.quantile1.bed
## 7594 closest_1st_GAT_to_GATA_peak_quantile0.2.bed
## 7796 closest_1st_GAT_to_GATA_peak_quantile0.4.bed
## 7796 closest_1st_GAT_to_GATA_peak_quantile0.6.bed
## 7796 closest_1st_GAT_to_GATA_peak_quantile0.8.bed
## 7796 closest_1st_GAT_to_GATA_peak_quantile1.bed
## 7563 closest_1st_GAT_to_GATA_uniq_peak_quantile0.2.bed
## 7781 closest_1st_GAT_to_GATA_uniq_peak_quantile0.4.bed
## 7785 closest_1st_GAT_to_GATA_uniq_peak_quantile0.6.bed
## 7781 closest_1st_GAT_to_GATA_uniq_peak_quantile0.8.bed
## 7791 closest_1st_GAT_to_GATA_uniq_peak_quantile1.bed
## 7569 closest_1st_GAT_to_localctrl_200_uniq_quantile0.2.sorted.bed
## 7791 closest_1st_GAT_to_localctrl_200_uniq_quantile0.4.sorted.bed
## 7784 closest_1st_GAT_to_localctrl_200_uniq_quantile0.6.sorted.bed
## 7781 closest_1st_GAT_to_localctrl_200_uniq_quantile0.8.sorted.bed
## 7792 closest_1st_GAT_to_localctrl_200_uniq_quantile1.sorted.bed
## 7594 localctrl.200.distance.quantile0.2.bed
## 7796 localctrl.200.distance.quantile0.4.bed
## 7796 localctrl.200.distance.quantile0.6.bed
## 7796 localctrl.200.distance.quantile0.8.bed
## 7796 localctrl.200.distance.quantile1.bed
## 7594 quantile0.2_summits.200bpctrl.bed
## 7594 quantile0.2_summits.bed
## 7796 quantile0.4_summits.200bpctrl.bed
## 7796 quantile0.4_summits.bed
## 7796 quantile0.6_summits.200bpctrl.bed
## 7796 quantile0.6_summits.bed
## 7796 quantile0.8_summits.200bpctrl.bed
## 7796 quantile0.8_summits.bed
## 7796 quantile1_summits.200bpctrl.bed
## 7796 quantile1_summits.bed
Now working on the top 20% quantile peaks:
wc -l quantile1_summits.bed
head quantile1_summits.bed
In this section, we use bedtools closestBed (refer to: https://bedtools.readthedocs.io/en/latest/content/tools/closest.html) to find the closest GAT to each provided peak summit.
Input:
-a is the sorted peak summit file (centered 1bp);
-b is the sorted, and concatenated GAT coordinates file (both plus and minus);
Input1 -a: peak: quantile1_summits.bed : peaks without motif 123456789, top20% intensity
dir=/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/overrep_3mer/peak_without_123456789/
head ${dir}quantile1_summits.bed
Input2 -b: GAT coordinates on full hg38 use read size==30
dir=/labs/Guertin/siyu/Sathyan_GATA3_ChIP_pool1_pool2/overrep_3mer/hg38_full_kmer3_rs30/seqdump/
head ${dir}hg38.3.3.3minus.14_GAT.bed
head ${dir}hg38.3.3.3plus.36_GAT.bed
head ${dir}hg38.3.3.3minus.36_ATC.bed
head ${dir}hg38.3.3.3plus.14_ATC.bed
These files are very large, here are the headed version:
head head2000_hg38.3.3.3plus.36_GAT.bed
echo ""
head head2000_hg38.3.3.3minus.14_GAT.bed
echo ""
head head2000_hg38.3.3.3plus.14_ATC.bed
echo ""
head head2000_hg38.3.3.3minus.36_ATC.bed
## chr1 11521 11524 36 36 + GAT
## chr1 12188 12191 36 36 + GAT
## chr1 13274 13277 36 36 + GAT
## chr1 13314 13317 36 36 + GAT
## chr1 15171 15174 36 36 + GAT
## chr1 15196 15199 36 36 + GAT
## chr1 15883 15886 36 36 + GAT
## chr1 16274 16277 36 36 + GAT
## chr1 33401 33404 36 36 + GAT
## chr1 39192 39195 36 36 + GAT
##
## chr1 11145 11148 14 14 - GAT
## chr1 11160 11163 14 14 - GAT
## chr1 11576 11579 14 14 - GAT
## chr1 13315 13318 14 14 - GAT
## chr1 19797 19800 14 14 - GAT
## chr1 27023 27026 14 14 - GAT
## chr1 28605 28608 14 14 - GAT
## chr1 30925 30928 14 14 - GAT
## chr1 47300 47303 14 14 - GAT
## chr1 47306 47309 14 14 - GAT
##
## chr1 13315 13318 14 14 + ATC
## chr1 15172 15175 14 14 + ATC
## chr1 19751 19754 14 14 + ATC
## chr1 30902 30905 14 14 + ATC
## chr1 38888 38891 14 14 + ATC
## chr1 39150 39153 14 14 + ATC
## chr1 39215 39218 14 14 + ATC
## chr1 41955 41958 14 14 + ATC
## chr1 47094 47097 14 14 + ATC
## chr1 47641 47644 14 14 + ATC
##
## chr1 11159 11162 36 36 - ATC
## chr1 11462 11465 36 36 - ATC
## chr1 11569 11572 36 36 - ATC
## chr1 13314 13317 36 36 - ATC
## chr1 27022 27025 36 36 - ATC
## chr1 40266 40269 36 36 - ATC
## chr1 41985 41988 36 36 - ATC
## chr1 47682 47685 36 36 - ATC
## chr1 47780 47783 36 36 - ATC
## chr1 49295 49298 36 36 - ATC
we will concatanate the plus and minus 3mer file together.
cat ${dir}hg38.3.3.3minus.14_GAT.bed ${dir}hg38.3.3.3plus.36_GAT.bed > hg38.3.3.3.30_plus_minus_GAT.bed
cat ${dir}hg38.3.3.3minus.36_ATC.bed ${dir}hg38.3.3.3plus.14_ATC.bed > hg38.3.3.3.30_plus_minus_ATC.bed
wc -l ${dir}hg38.3.3.3minus.14_GAT.bed #32096009
wc -l ${dir}hg38.3.3.3plus.36_GAT.bed #32147038
wc -l hg38.3.3.3.30_plus_minus_GAT.bed #64243047
echo ""
wc -l ${dir}hg38.3.3.3minus.36_ATC.bed #32081985
wc -l ${dir}hg38.3.3.3plus.14_ATC.bed #32035017
wc -l hg38.3.3.3.30_plus_minus_ATC.bed #64117002
load files
load files contains selected ChIP peak summit info
chip.peak.summit=read.table("quantile1_summits.bed", header=FALSE)
nrow(chip.peak.summit)
## [1] 7796
head(chip.peak.summit)
## V1 V2 V3 V4
## 1 chr1 5598187 5598188 271.93122
## 2 chr1 8017660 8017661 676.31362
## 3 chr1 8020178 8020179 324.95549
## 4 chr1 8055212 8055213 111.55874
## 5 chr1 8061475 8061476 98.51927
## 6 chr1 8120791 8120792 124.49564
load files contains 3mer coordinates info
#concatenate the plus and minus file together
all.GAT.file=read.table(file = "hg38.3.3.3.30_plus_minus_GAT.bed", sep="\t", header=FALSE)
head(all.GAT.file)
# V1 V2 V3 V4 V5 V6 V7
#1 chr1 11145 11148 14 14 - GAT
#2 chr1 11160 11163 14 14 - GAT
#3 chr1 11576 11579 14 14 - GAT
#4 chr1 13315 13318 14 14 - GAT
#5 chr1 19797 19800 14 14 - GAT
#6 chr1 27023 27026 14 14 - GAT
tail(all.GAT.file)
# V1 V2 V3 V4 V5 V6 V7
#64243042 chrY_KI270740v1_random 33028 33031 36 36 + GAT
#64243043 chrY_KI270740v1_random 35533 35536 36 36 + GAT
#64243044 chrY_KI270740v1_random 35543 35546 36 36 + GAT
#64243045 chrY_KI270740v1_random 35912 35915 36 36 + GAT
#64243046 chrY_KI270740v1_random 36540 36543 36 36 + GAT
#64243047 chrY_KI270740v1_random 36913 36916 36 36 + GAT
nrow(all.GAT.file)
#[1] 64243047
bedtools closestBed
The below function will sort input bed1 and bed2 first, then run bedtools closestBed between bed1 and bed2.
# define function
bedTools.closest <- function(functionstring="/home/FCAM/ssun/packages/bedtools2/bin/closestBed",bed1,bed2,opt.string="") {
options(scipen =99) # not use scientific notation when writing out
#write bed formatted data.frames to tempfile
write.table(bed1,file= 'a.file.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(bed2,file= 'b.file.bed', quote=F,sep="\t",col.names=F,row.names=F)
# create the command string and call the command using system()
# the command sort a and b file by coordinates
command1=paste('sort -k1,1 -k2,2n', 'a.file.bed', '> a.file.sorted.bed')
cat(command1,"\n") #sort -k1,1 -k2,2n a.file.bed > a.file.sorted.bed
try(system(command1))
command2=paste('sort -k1,1 -k2,2n', 'b.file.bed', '> b.file.sorted.bed')
cat(command2,"\n")
try(system(command2))
# the command call closestBed on bed1 and bed2
command=paste(functionstring, opt.string,"-a",'a.file.sorted.bed',"-b",'b.file.sorted.bed',">",'out.file.bed',sep=" ")
cat(command,"\n")
try(system(command))
res=read.table('out.file.bed',header=F, comment.char='')
# remove intermediate files
command3=paste('rm', 'a.file.bed', 'b.file.bed', 'a.file.sorted.bed', 'b.file.sorted.bed', 'out.file.bed')
cat(command3,"\n")
try(system(command3))
colnames(res) = c(colnames(bed1), colnames(bed2), 'dis' )
return(res)
}
Parameter -d will report the distance from the closest GAT to the peak summit.
Parameter -t last or -t first will only report either the last or the first entry in bed2 file if tied distance occurred.
Here we choose to use -t first to report the first coordinates when tie occurred.
Find the closest GAT to peak summit regardless of GAT strandedness
all.distance=bedTools.closest(bed1 = chip.peak.summit[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
write.table(all.distance,file= 'all.distance.bed', quote=F,sep="\t",col.names=F,row.names=F)
save.image('231205_GATA3_ChIP_clsestbed.Rdata')
In this “all.distance’ data frame, V1 to V3 are peak summit coordinates, V4 to V9 are closestBed reported closest motif to each peak summit. The last column (V11) is the distance from closest motif to peak summit.
all.distance=read.table('all.distance.bed',header=F, comment.char='')
head(all.distance)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
## 1 chr1 5598187 5598188 chr1 5598164 5598167 36 36 + GAT 21
## 2 chr1 8017660 8017661 chr1 8017663 8017666 36 36 + GAT 3
## 3 chr1 8020178 8020179 chr1 8020170 8020173 14 14 - GAT 6
## 4 chr1 8055212 8055213 chr1 8055202 8055205 36 36 + GAT 8
## 5 chr1 8061475 8061476 chr1 8061441 8061444 36 36 + GAT 32
## 6 chr1 8120791 8120792 chr1 8120785 8120788 14 14 - GAT 4
coherence check 1: the number of the closest minus GAT is comparable to the number of the closest plus GAT.
nrow(all.distance)
## [1] 7796
nrow(all.distance[all.distance$V9=="+",])
## [1] 3873
nrow(all.distance[all.distance$V9=="-",])
## [1] 3923
coherence check 2: Since we have use -t first to take care of the tie, now we have equal number of unique peak with their closest plus or minus or both GAT.
nrow(chip.peak.summit) # there are 8571 unique peak
## [1] 7796
nrow(all.distance)
## [1] 7796
To answer the first big question, we also need to prepare a control peak set. We want to know the 3mer distribution of GAT/ATC in this random peak set, and compare it with our GATA3 ChIP peaks.
Here we are using the ENCODE union DNase HS sites:
Reference: Amaral, M. L., Erikson, G. A., & Shokhirev, M. N. (2018). BART: bioinformatics array research tool. BMC bioinformatics, 19(1), 1-6.
These union DHS regions represent the whole repertoire of regulatory elements in the human genome. These are regions of chromatin that are sensitive to cleavage by the DNase.
remove GATA3 motifs with MAST
The full .bed file contains 2723010 regulatory regions. To make this file a proper control for our peak set, we use MAST to remove regulatory regions that contains GATA3-like motif 123456789 (use same p-value stringency).
We have 9 motifs that found by MEME/STREME software.
ls GATA3_MEME_STREME_motifs/
#cat GATA3_MEME_STREME_motifs/meme_1.txt
## AGATAAM_streme.txt
## AGATDNHATCT_streme.txt
## AGATNDWNAGATARN_meme.txt_meme_4.txt
## BTTATCWGATB_meme_5.txt_meme.txt
## TGATAA_streme.txt
## WGATBDHRVAGATAA_meme.txt_meme_6.txt
## meme_1.txt
## meme_2.txt
## meme_3.txt
First convert regulatory region file from .bed to .fasta.
module load bedtools
genome=/home/FCAM/ssun/Genome/hg38.fa
fastaFromBed -fi $genome -bed hg38_unionDHS_fc4_50merge.bed -fo hg38_unionDHS_fc4_50merge.fasta
MEME (mast uses default p-value: 0.0001)
module load meme/5.4.1
module load R/4.1.2
module load bedtools
genome=/home/FCAM/ssun/Genome/hg38.fa
#round1
mast -hit_list -best ../GATA3_MEME_STREME_motifs/meme_1.txt hg38_unionDHS_fc4_50merge.fasta > mast_GATA3_PSWM_in_unionDHS_round1.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_unionDHS_round1.txt
wc -l mast_GATA3_PSWM_in_unionDHS_round1.bed #34324
intersectBed -v -a hg38_unionDHS_fc4_50merge.bed -b mast_GATA3_PSWM_in_unionDHS_round1.bed > unionDHS_without_motifs_1.bed
wc -l unionDHS_without_motifs_1.bed #2688686
#round2
fastaFromBed -fi $genome -bed unionDHS_without_motifs_1.bed -fo unionDHS_without_motifs_1.fasta
mast -hit_list -best ../GATA3_MEME_STREME_motifs/meme_2.txt unionDHS_without_motifs_1.fasta > mast_GATA3_PSWM_in_unionDHS_round2.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_unionDHS_round2.txt
wc -l mast_GATA3_PSWM_in_unionDHS_round2.bed #38803
intersectBed -v -a unionDHS_without_motifs_1.bed -b mast_GATA3_PSWM_in_unionDHS_round2.bed > unionDHS_without_motifs_12.bed
wc -l unionDHS_without_motifs_12.bed #2649883
#round3
fastaFromBed -fi $genome -bed unionDHS_without_motifs_12.bed -fo unionDHS_without_motifs_12.fasta
mast -hit_list -best ../GATA3_MEME_STREME_motifs/meme_3.txt unionDHS_without_motifs_12.fasta > mast_GATA3_PSWM_in_unionDHS_round3.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_unionDHS_round3.txt
wc -l mast_GATA3_PSWM_in_unionDHS_round3.bed #35361
intersectBed -v -a unionDHS_without_motifs_12.bed -b mast_GATA3_PSWM_in_unionDHS_round3.bed > unionDHS_without_motifs_123.bed
wc -l unionDHS_without_motifs_123.bed #2614522
#round4
fastaFromBed -fi $genome -bed unionDHS_without_motifs_123.bed -fo unionDHS_without_motifs_123.fasta
mast -hit_list -best ../GATA3_MEME_STREME_motifs/AGATNDWNAGATARN_meme.txt_meme_4.txt unionDHS_without_motifs_123.fasta > mast_GATA3_PSWM_in_unionDHS_round4.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_unionDHS_round4.txt
wc -l mast_GATA3_PSWM_in_unionDHS_round4.bed #40994
intersectBed -v -a unionDHS_without_motifs_123.bed -b mast_GATA3_PSWM_in_unionDHS_round4.bed > unionDHS_without_motifs_1234.bed
wc -l unionDHS_without_motifs_1234.bed #2573528
#round5
fastaFromBed -fi $genome -bed unionDHS_without_motifs_1234.bed -fo unionDHS_without_motifs_1234.fasta
mast -hit_list -best ../GATA3_MEME_STREME_motifs/BTTATCWGATB_meme_5.txt_meme.txt unionDHS_without_motifs_1234.fasta > mast_GATA3_PSWM_in_unionDHS_round5.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_unionDHS_round5.txt
wc -l mast_GATA3_PSWM_in_unionDHS_round5.bed #29331
intersectBed -v -a unionDHS_without_motifs_1234.bed -b mast_GATA3_PSWM_in_unionDHS_round5.bed > unionDHS_without_motifs_12345.bed
wc -l unionDHS_without_motifs_12345.bed #2544197
#round6
fastaFromBed -fi $genome -bed unionDHS_without_motifs_12345.bed -fo unionDHS_without_motifs_12345.fasta
mast -hit_list -best ../GATA3_MEME_STREME_motifs/WGATBDHRVAGATAA_meme.txt_meme_6.txt unionDHS_without_motifs_12345.fasta > mast_GATA3_PSWM_in_unionDHS_round6.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_unionDHS_round6.txt
wc -l mast_GATA3_PSWM_in_unionDHS_round6.bed #42590
intersectBed -v -a unionDHS_without_motifs_12345.bed -b mast_GATA3_PSWM_in_unionDHS_round6.bed > unionDHS_without_motifs_123456.bed
wc -l unionDHS_without_motifs_123456.bed #2501607
STREME (mast uses p-value of 0.0005)
#round7
fastaFromBed -fi $genome -bed unionDHS_without_motifs_123456.bed -fo unionDHS_without_motifs_123456.fasta
mast -mt 0.0005 -hit_list -best ../GATA3_MEME_STREME_motifs/AGATAAM_streme.txt unionDHS_without_motifs_123456.fasta > mast_GATA3_PSWM_in_unionDHS_round7.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_unionDHS_round7.txt
wc -l mast_GATA3_PSWM_in_unionDHS_round7.bed #163724
intersectBed -v -a unionDHS_without_motifs_123456.bed -b mast_GATA3_PSWM_in_unionDHS_round7.bed > unionDHS_without_motifs_123456_7.bed
wc -l unionDHS_without_motifs_123456_7.bed #2337883
#round8
fastaFromBed -fi $genome -bed unionDHS_without_motifs_123456_7.bed -fo unionDHS_without_motifs_123456_7.fasta
mast -mt 0.0005 -hit_list -best ../GATA3_MEME_STREME_motifs/TGATAA_streme.txt unionDHS_without_motifs_123456_7.fasta > mast_GATA3_PSWM_in_unionDHS_round8.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_unionDHS_round8.txt
wc -l mast_GATA3_PSWM_in_unionDHS_round8.bed #101810
intersectBed -v -a unionDHS_without_motifs_123456_7.bed -b mast_GATA3_PSWM_in_unionDHS_round8.bed > unionDHS_without_motifs_123456_78.bed
wc -l unionDHS_without_motifs_123456_78.bed #2236073
#round9
fastaFromBed -fi $genome -bed unionDHS_without_motifs_123456_78.bed -fo unionDHS_without_motifs_123456_78.fasta
mast -mt 0.0005 -hit_list -best ../GATA3_MEME_STREME_motifs/AGATDNHATCT_streme.txt unionDHS_without_motifs_123456_78.fasta > mast_GATA3_PSWM_in_unionDHS_round9.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_unionDHS_round9.txt
wc -l mast_GATA3_PSWM_in_unionDHS_round9.bed #61162
intersectBed -v -a unionDHS_without_motifs_123456_78.bed -b mast_GATA3_PSWM_in_unionDHS_round9.bed > unionDHS_without_motifs_123456_789.bed
wc -l unionDHS_without_motifs_123456_789.bed #2174911
Randomly subset the union regulatory region
After the removal of the 9 GATA3 motifs with mast, the unionDHS peak set now remains 2174911 regions (originally 2723010, removed ~20% regions).
Here I am using shuf to randomly select 7800 regulatory region to match with the GATA peak subset (quantile1: 7796 peaks) I am going to compare with.
shuf -n 7800 unionDHS_without_motifs_123456_789.bed > random_7800_unionDHS_without_motifs_123456_789.bed
wc -l random_7800_unionDHS_without_motifs_123456_789.bed
head random_7800_unionDHS_without_motifs_123456_789.bed
## 7800 random_7800_unionDHS_without_motifs_123456_789.bed
## chr1 58307349 58307399 68093 50 +
## chr10 80965538 80965588 1642740 50 +
## chr4 176763006 176763056 819823 50 +
## chr11 72552854 72553010 1763461 156 +
## chr1 68189168 68189373 79518 205 +
## chr5 24465079 24465129 852478 50 +
## chr19 17195481 17195531 2459669 50 +
## chr1 94664602 94664652 104394 50 +
## chr2 56036 56086 234875 50 +
## chr17 42150025 42150076 2322289 51 +
Read the file in and use the center as regulatory region summit:
library(bigWig)
chip.peak=read.table("random_7800_unionDHS_without_motifs_123456_789.bed", header=FALSE)
nrow(chip.peak)
## [1] 7800
head(chip.peak)
## V1 V2 V3 V4 V5 V6
## 1 chr1 58307349 58307399 68093 50 +
## 2 chr10 80965538 80965588 1642740 50 +
## 3 chr4 176763006 176763056 819823 50 +
## 4 chr11 72552854 72553010 1763461 156 +
## 5 chr1 68189168 68189373 79518 205 +
## 6 chr5 24465079 24465129 852478 50 +
chip.peak.summit=center.bed(chip.peak, upstreamWindow=0, downstreamWindow=0)
nrow(chip.peak.summit)
## [1] 7800
head(chip.peak.summit)
## V1 V2 V3 V4 V5 V6
## 1 chr1 58307374 58307375 68093 50 +
## 2 chr10 80965563 80965564 1642740 50 +
## 3 chr4 176763031 176763032 819823 50 +
## 4 chr11 72552932 72552933 1763461 156 +
## 5 chr1 68189270 68189271 79518 205 +
## 6 chr5 24465104 24465105 852478 50 +
find the closest GAT to summit of unionDHS with closestBed.
ctrl.distance=bedTools.closest(bed1 = chip.peak.summit[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
write.table(ctrl.distance,file= 'ctrl.distance.bed', quote=F,sep="\t",col.names=F,row.names=F)
save.image('231205_GATA3_ChIP_clsestbed2.Rdata')
In this “ctrl.distance’ data frame, V1 to V3 are peak summit coordinates, V4 to V9 are closestBed reported closest motif to each peak summit. The last column (V11) is the distance from closest motif to peak summit.
ctrl.distance=read.table('ctrl.distance.bed',header=F, comment.char='')
head(ctrl.distance)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
## 1 chr1 190928 190929 chr1 191000 191003 14 14 - GAT 72
## 2 chr1 1183557 1183558 chr1 1183537 1183540 14 14 - GAT 18
## 3 chr1 1207307 1207308 chr1 1207290 1207293 14 14 - GAT 15
## 4 chr1 1548961 1548962 chr1 1548964 1548967 36 36 + GAT 3
## 5 chr1 1666170 1666171 chr1 1666192 1666195 36 36 + GAT 22
## 6 chr1 2298957 2298958 chr1 2299000 2299003 36 36 + GAT 43
nrow(ctrl.distance)
## [1] 7800
nrow(ctrl.distance[ctrl.distance$V9=="+",])
## [1] 3894
nrow(ctrl.distance[ctrl.distance$V9=="-",])
## [1] 3906
all.distance=read.table('all.distance.bed',header=F, comment.char='')
ctrl.distance=read.table('ctrl.distance.bed',header=F, comment.char='')
head(all.distance)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
## 1 chr1 5598187 5598188 chr1 5598164 5598167 36 36 + GAT 21
## 2 chr1 8017660 8017661 chr1 8017663 8017666 36 36 + GAT 3
## 3 chr1 8020178 8020179 chr1 8020170 8020173 14 14 - GAT 6
## 4 chr1 8055212 8055213 chr1 8055202 8055205 36 36 + GAT 8
## 5 chr1 8061475 8061476 chr1 8061441 8061444 36 36 + GAT 32
## 6 chr1 8120791 8120792 chr1 8120785 8120788 14 14 - GAT 4
head(ctrl.distance)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
## 1 chr1 190928 190929 chr1 191000 191003 14 14 - GAT 72
## 2 chr1 1183557 1183558 chr1 1183537 1183540 14 14 - GAT 18
## 3 chr1 1207307 1207308 chr1 1207290 1207293 14 14 - GAT 15
## 4 chr1 1548961 1548962 chr1 1548964 1548967 36 36 + GAT 3
## 5 chr1 1666170 1666171 chr1 1666192 1666195 36 36 + GAT 22
## 6 chr1 2298957 2298958 chr1 2299000 2299003 36 36 + GAT 43
nrow(all.distance)
## [1] 7796
nrow(ctrl.distance)
## [1] 7800
df.chip = cbind(all.distance[,c(1:3, 11)], "GATA3_peak")
df.ctrl = cbind(ctrl.distance[,c(1:3, 11)], "ctrl")
colnames(df.chip) = c(colnames(all.distance)[1:3], "dis", "status")
colnames(df.ctrl) = c(colnames(ctrl.distance)[1:3], "dis", "status")
head(df.chip)
## V1 V2 V3 dis status
## 1 chr1 5598187 5598188 21 GATA3_peak
## 2 chr1 8017660 8017661 3 GATA3_peak
## 3 chr1 8020178 8020179 6 GATA3_peak
## 4 chr1 8055212 8055213 8 GATA3_peak
## 5 chr1 8061475 8061476 32 GATA3_peak
## 6 chr1 8120791 8120792 4 GATA3_peak
head(df.ctrl)
## V1 V2 V3 dis status
## 1 chr1 190928 190929 72 ctrl
## 2 chr1 1183557 1183558 18 ctrl
## 3 chr1 1207307 1207308 15 ctrl
## 4 chr1 1548961 1548962 3 ctrl
## 5 chr1 1666170 1666171 22 ctrl
## 6 chr1 2298957 2298958 43 ctrl
df.all = rbind(df.chip, df.ctrl)
df.all$status = factor(df.all$status, levels = c("GATA3_peak", "ctrl"))
head(df.all)
## V1 V2 V3 dis status
## 1 chr1 5598187 5598188 21 GATA3_peak
## 2 chr1 8017660 8017661 3 GATA3_peak
## 3 chr1 8020178 8020179 6 GATA3_peak
## 4 chr1 8055212 8055213 8 GATA3_peak
## 5 chr1 8061475 8061476 32 GATA3_peak
## 6 chr1 8120791 8120792 4 GATA3_peak
nrow(df.all)
## [1] 15596
plot CDF
We are plotting the cumulative distribution function to evaluate whether a 3mer GAT are closer to the summit of GATA3 ChIP peaks (that might use 3mer GAT as a binding sequence) than they are to the summit of a random subset of union DHS regions (ctrl).
We can also determine the distance at which the CDFs’ difference between GATA3 peak set and ctrl peak set reach the plateaus (where CDF traces became parallel or converging).
# Specify categories to compare within df.all
match = ecdf(abs(df.all$dis)[df.all$status == 'ctrl'])
rep = ecdf(abs(df.all$dis)[df.all$status == 'GATA3_peak'])
match.y = seq(0, 7800, by=1) # creating indices
rep.y = seq(0, 7800, by=1)
spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)
print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1]) # 15;
## [1] 15
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
xlim = c(0,60),
cex=0.7,
xlab = '3mer-GAT Distance from peak summit (bp)',
ylab = 'GATA3 peak CDF - ctrl(unionDHS) CDF')
abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
lines(pred[[1]], pred[[2]], col = 'blue')
# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
xlim = c(0,300),
cex=0.7,
xlab = '3mer-GAT Distance from peak summit (bp)',
ylab = 'GATA3 peak CDF - ctrl(unionDHS) CDF')
abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
lines(pred[[1]], pred[[2]], col = 'blue')
The above plot plots CDFs’ difference between GATA3 peak set and ctrl peak set. At 15bp, two CDFs start to be parallel/converged.
library(lattice)
library(latticeExtra)
ecdfplot(~log(abs(dis), base = 10), groups = status, data = df.all,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
col = c("#ce228e", "grey60"),
aspect = 1,
#xlim = c(0, 50000),
scales=list(relation="free",alternating=c(1,1,1,1)),
ylab = 'Cumulative Distribution Function',
xlab = expression('log'[10]~'3mer-GAT Distance from peak summit'),
#index.cond = list(c(2,1)),
between=list(y=1.0),
type = 'a',
xlim = c(0,3.5),
lwd=2,
lty=c(1),
par.settings = list(superpose.line = list(col = c("#ce228e", "grey60"), lwd=3), strip.background=list(col="grey85")),
panel = function(...) {
panel.abline(v= 1.176, lty =2)
panel.ecdfplot(...)
})
The above cumulative distribution function (PDF) plot is showing the distance between 3mer-GAT to the summit of GATA3 ChIP peak (red), or summit of the ctrl peak set (gray, random subset from the union DHS regions).
The left-shift of the red curve suggests that the 3mer-GAT are closer to the GATA3 ChIP peak set then the ctrl peak set.
The vertical dashed line indicates at which distance (log10 of 15bp) the two CDF traces became parallel. This suggests that the 3mer-GAT often binds within 15 bp of the GATA3 peak summit.
ecdfplot(~dis, groups = status, data = df.all,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
col = c("#ce228e", "grey60"),
aspect = 1,
#xlim = c(0, 50000),
scales=list(relation="free",alternating=c(1,1,1,1)),
ylab = 'Cumulative Distribution Function',
xlab = '3mer-GAT Distance (bp) from peak summit',
#index.cond = list(c(2,1)),
between=list(y=1.0),
type = 'a',
xlim = c(0,100),
lwd=2,
lty=c(1),
par.settings = list(superpose.line = list(col = c("#ce228e", "grey60"), lwd=3), strip.background=list(col="grey85")),
panel = function(...) {
panel.abline(v= 15, lty =2)
panel.ecdfplot(...)
})
The difference between this CDF with the previous one is: The above one is plotting actual distance in bp, the previous one is plotting log10 of the distance. The two CDF plots are essentially communicating the same thing.
Expectation: see spike of the second GAT at fixed position.
Plan:
-Input1: all GAT coordinates .bed file
-Input2: closest GAT (to GATA3 peak summit -without GATA3-like motifs 123456789) coordinates .bed file
-Input3: closest GAT (to unionDHS summit) coordinates .bed file
First, use bedtools subtract to subtract input 1 and 2 (or in ctrl set: subtract input 1 and 3), the output will be GAT coordinates without the 1st closest GAT.
Then,
Use bedtools closestBed to find the second closest GAT to the 1st closest GAT use the output from bedtools subtract.
Plot PDF plot of distances between 1st and 2nd GAT.
Input1
This is the 3mer GAT coordinates on hg38 genome (both plus and minus strand) generated with SeqOutbias use read size==30.
wc -l hg38.3.3.3.30_plus_minus_GAT.bed #64243047
head -5 hg38.3.3.3.30_plus_minus_GAT.bed
#chr1 11145 11148 14 14 - GAT
#chr1 11160 11163 14 14 - GAT
#chr1 11576 11579 14 14 - GAT
#chr1 13315 13318 14 14 - GAT
#chr1 19797 19800 14 14 - GAT
Input2
This all.distance.bed file is searching for closest 3mer GAT to the peak summit (top 20% intense peak without motif 123456789).
Use awk to select 3mer-GAT’s coordinates from the file.
wc -l all.distance.bed #7796
head -5 all.distance.bed
#chr1 5598187 5598188 chr1 5598164 5598167 36 36 + GAT 21
#chr1 8017660 8017661 chr1 8017663 8017666 36 36 + GAT 3
#chr1 8020178 8020179 chr1 8020170 8020173 14 14 - GAT 6
#chr1 8055212 8055213 chr1 8055202 8055205 36 36 + GAT 8
#chr1 8061475 8061476 chr1 8061441 8061444 36 36 + GAT 32
awk '{print $4, $5, $6, $7, $8, $9, $10}' all.distance.bed | awk '{$1=$1}1' OFS="\t"> closest_1st_GAT_to_GATA_peak.bed
wc -l closest_1st_GAT_to_GATA_peak.bed #7796
# remove duplicated lines with `uniq`
awk '{print $4, $5, $6, $7, $8, $9, $10}' all.distance.bed | awk '{$1=$1}1' OFS="\t" | sort | uniq> closest_1st_GAT_to_GATA_uniq_peak.bed
wc -l closest_1st_GAT_to_GATA_uniq_peak.bed #7791
head -5 closest_1st_GAT_to_GATA_uniq_peak.bed
awk '$6 == "+" {count++} END {print "Positive strand count:", count}' closest_1st_GAT_to_GATA_uniq_peak.bed
awk '$6 == "-" {count++} END {print "Positive strand count:", count}' closest_1st_GAT_to_GATA_uniq_peak.bed
## chr10 100012503 100012506 14 14 - GAT
## chr10 100043505 100043508 14 14 - GAT
## chr10 100365033 100365036 14 14 - GAT
## chr10 100370706 100370709 14 14 - GAT
## chr10 101295816 101295819 14 14 - GAT
## Positive strand count: 3870
## Positive strand count: 3921
Bedtools subtract
-f Requiring a minimal overlap fraction before subtracting. Here we define -f 1.00 to make sure of a 100% overlap between two file before subtracting.
-s Enforcing same “strandedness” while scanning for features in -b file that should be subtracted from -a file.
module load bedtools
sort -k1,1 -k2,2n hg38.3.3.3.30_plus_minus_GAT.bed > hg38.3.3.3.30_plus_minus_GAT.sorted.bed
head -5 hg38.3.3.3.30_plus_minus_GAT.sorted.bed
#chr1 11145 11148 14 14 - GAT
#chr1 11160 11163 14 14 - GAT
#chr1 11521 11524 36 36 + GAT
#chr1 11576 11579 14 14 - GAT
#chr1 12188 12191 36 36 + GAT
sort -k1,1 -k2,2n closest_1st_GAT_to_GATA_uniq_peak.bed > closest_1st_GAT_to_GATA_uniq_peak.sorted.bed
head -5 closest_1st_GAT_to_GATA_uniq_peak.sorted.bed
#chr1 5598164 5598167 36 36 + GAT
#chr1 8017663 8017666 36 36 + GAT
#chr1 8020170 8020173 14 14 - GAT
#chr1 8055202 8055205 36 36 + GAT
#chr1 8061441 8061444 36 36 + GAT
bedtools subtract -a hg38.3.3.3.30_plus_minus_GAT.sorted.bed -b closest_1st_GAT_to_GATA_uniq_peak.sorted.bed -f 1.00 -s > hg38.3.3.3.30_plus_minus_GAT_without_1st_closest_GAT.bed
wc -l hg38.3.3.3.30_plus_minus_GAT.sorted.bed #64243047
wc -l closest_1st_GAT_to_GATA_uniq_peak.sorted.bed #7791
wc -l hg38.3.3.3.30_plus_minus_GAT_without_1st_closest_GAT.bed #64235256
#64243047-7791
#[1] 64235256
Input3
This ctrl.distance.bed file is searching for closest 3mer GAT to the unionDHS peak summit (ramdom 7800 regions without motif 123456789).
Use awk to select 3mer-GAT’s coordinates from the file.
wc -l ctrl.distance.bed
head -5 all.distance.bed
awk '{print $4, $5, $6, $7, $8, $9, $10}' ctrl.distance.bed | awk '{$1=$1}1' OFS="\t"> closest_1st_GAT_to_unionDHS_peak.bed
wc -l closest_1st_GAT_to_unionDHS_peak.bed
awk '{print $4, $5, $6, $7, $8, $9, $10}' ctrl.distance.bed | awk '{$1=$1}1' OFS="\t" | sort | uniq> closest_1st_GAT_to_unionDHS_uniq_peak.bed
wc -l closest_1st_GAT_to_unionDHS_uniq_peak.bed
## 7800 ctrl.distance.bed
## chr1 5598187 5598188 chr1 5598164 5598167 36 36 + GAT 21
## chr1 8017660 8017661 chr1 8017663 8017666 36 36 + GAT 3
## chr1 8020178 8020179 chr1 8020170 8020173 14 14 - GAT 6
## chr1 8055212 8055213 chr1 8055202 8055205 36 36 + GAT 8
## chr1 8061475 8061476 chr1 8061441 8061444 36 36 + GAT 32
## 7800 closest_1st_GAT_to_unionDHS_peak.bed
## 7800 closest_1st_GAT_to_unionDHS_uniq_peak.bed
head -5 closest_1st_GAT_to_unionDHS_uniq_peak.bed
awk '$6 == "+" {count++} END {print "Positive strand count:", count}' closest_1st_GAT_to_unionDHS_uniq_peak.bed
awk '$6 == "-" {count++} END {print "Positive strand count:", count}' closest_1st_GAT_to_unionDHS_uniq_peak.bed
## chr1 10001686 10001689 36 36 + GAT
## chr1 10039965 10039968 36 36 + GAT
## chr1 100703247 100703250 14 14 - GAT
## chr1 100733243 100733246 36 36 + GAT
## chr1 101907495 101907498 36 36 + GAT
## Positive strand count: 3894
## Positive strand count: 3906
bedtools subtract
-f Requiring a minimal overlap fraction before subtracting. Here we define -f 1.00 to make sure of a 100% overlap between two file before subtracting.
-s Enforcing same “strandedness” while scanning for features in -b file that should be subtracted from -a file.
module load bedtools
sort -k1,1 -k2,2n hg38.3.3.3.30_plus_minus_GAT.bed > hg38.3.3.3.30_plus_minus_GAT.sorted.bed
head -5 hg38.3.3.3.30_plus_minus_GAT.sorted.bed
#chr1 11145 11148 14 14 - GAT
#chr1 11160 11163 14 14 - GAT
#chr1 11521 11524 36 36 + GAT
#chr1 11576 11579 14 14 - GAT
#chr1 12188 12191 36 36 + GAT
sort -k1,1 -k2,2n closest_1st_GAT_to_unionDHS_uniq_peak.bed > closest_1st_GAT_to_unionDHS_uniq_peak.sorted.bed
head -5 closest_1st_GAT_to_unionDHS_uniq_peak.sorted.bed
#chr1 191000 191003 14 14 - GAT
#chr1 1183537 1183540 14 14 - GAT
#chr1 1207290 1207293 14 14 - GAT
#chr1 1548964 1548967 36 36 + GAT
#chr1 1666192 1666195 36 36 + GAT
bedtools subtract -a hg38.3.3.3.30_plus_minus_GAT.sorted.bed -b closest_1st_GAT_to_unionDHS_uniq_peak.sorted.bed -f 1.00 -s > hg38.3.3.3.30_plus_minus_GAT_without_1st_closest_GAT_unionDHS.bed
wc -l hg38.3.3.3.30_plus_minus_GAT.sorted.bed #64243047
wc -l closest_1st_GAT_to_unionDHS_uniq_peak.sorted.bed #7800
wc -l hg38.3.3.3.30_plus_minus_GAT_without_1st_closest_GAT_unionDHS.bed #64235247
#64243047-7800
#[1] 64235247
bedtools closestBedGATA3 peak subset
Read the file (1st closest GAT to GATA3 peak subset summit) in.
closest_1st_GAT_to_GATA=read.table("closest_1st_GAT_to_GATA_uniq_peak.sorted.bed", header=FALSE)
nrow(closest_1st_GAT_to_GATA)
## [1] 7791
head(closest_1st_GAT_to_GATA)
## V1 V2 V3 V4 V5 V6 V7
## 1 chr1 5598164 5598167 36 36 + GAT
## 2 chr1 8017663 8017666 36 36 + GAT
## 3 chr1 8020170 8020173 14 14 - GAT
## 4 chr1 8055202 8055205 36 36 + GAT
## 5 chr1 8061441 8061444 36 36 + GAT
## 6 chr1 8120785 8120788 14 14 - GAT
Read the GAT coorrdinates file without the 1st closest GAT to GATA3 peak summit (subset of top20% peaks without GATA3-like motifs 123456789).
all.GAT.without.1st.closest.file=read.table(file = "hg38.3.3.3.30_plus_minus_GAT_without_1st_closest_GAT.bed", sep="\t", header=FALSE)
nrow(all.GAT.without.1st.closest.file)
#[1] 64235256
head(all.GAT.without.1st.closest.file)
# V1 V2 V3 V4 V5 V6 V7
#1 chr1 11145 11148 14 14 - GAT
#2 chr1 11160 11163 14 14 - GAT
#3 chr1 11521 11524 36 36 + GAT
#4 chr1 11576 11579 14 14 - GAT
#5 chr1 12188 12191 36 36 + GAT
#6 chr1 13274 13277 36 36 + GAT
tail(all.GAT.without.1st.closest.file)
# V1 V2 V3 V4 V5 V6 V7
#64235251 chrY_KI270740v1_random 33149 33152 14 14 - GAT
#64235252 chrY_KI270740v1_random 35533 35536 36 36 + GAT
#64235253 chrY_KI270740v1_random 35543 35546 36 36 + GAT
#64235254 chrY_KI270740v1_random 35912 35915 36 36 + GAT
#64235255 chrY_KI270740v1_random 36540 36543 36 36 + GAT
#64235256 chrY_KI270740v1_random 36913 36916 36 36 + GAT
Use bedtools closestBed to find the 2nd closest GAT to the 1st closest GAT use the output from bedtools subtract.
closest.2nd.distance=bedTools.closest(bed1 = closest_1st_GAT_to_GATA[,1:3], bed2 = all.GAT.without.1st.closest.file, opt.string = '-d -t first')
write.table(closest.2nd.distance,file= 'closest.2nd.distance.bed', quote=F,sep="\t",col.names=F,row.names=F)
save.image('231206_GATA3_ChIP_2nd_clsestbed.Rdata')
UnionDHS regions
Read the file (1st closest GAT to unionHDS summit) in:
closest_1st_GAT_to_unionDHS=read.table("closest_1st_GAT_to_unionDHS_uniq_peak.sorted.bed", header=FALSE)
nrow(closest_1st_GAT_to_unionDHS)
## [1] 7800
head(closest_1st_GAT_to_unionDHS)
## V1 V2 V3 V4 V5 V6 V7
## 1 chr1 191000 191003 14 14 - GAT
## 2 chr1 1183537 1183540 14 14 - GAT
## 3 chr1 1207290 1207293 14 14 - GAT
## 4 chr1 1548964 1548967 36 36 + GAT
## 5 chr1 1666192 1666195 36 36 + GAT
## 6 chr1 2299000 2299003 36 36 + GAT
Read the GAT coorrdinates file without the 1st closest GAT.
all.GAT.without.1st.closest.ctrl.file=read.table(file = "hg38.3.3.3.30_plus_minus_GAT_without_1st_closest_GAT_unionDHS.bed", sep="\t", header=FALSE)
nrow(all.GAT.without.1st.closest.ctrl.file)
#[1] 64235247
head(all.GAT.without.1st.closest.ctrl.file)
# V1 V2 V3 V4 V5 V6 V7
#1 chr1 11145 11148 14 14 - GAT
#2 chr1 11160 11163 14 14 - GAT
#3 chr1 11521 11524 36 36 + GAT
#4 chr1 11576 11579 14 14 - GAT
#5 chr1 12188 12191 36 36 + GAT
#6 chr1 13274 13277 36 36 + GAT
Use bedtools closestBed to find the 2nd closest GAT to the 1st closest GAT use the output from bedtools subtract.
closest.2nd.distance.unionDHS=bedTools.closest(bed1 = closest_1st_GAT_to_unionDHS[,1:3], bed2 = all.GAT.without.1st.closest.ctrl.file, opt.string = '-d -t first')
write.table(closest.2nd.distance.unionDHS, file= 'closest.2nd.distance.unionDHS.bed', quote=F,sep="\t",col.names=F,row.names=F)
save.image('231206_GATA3_ChIP_2nd_clsestbed2.Rdata')
In the previous section, we have found the 2nd closest GAT to the 1st closest GAT to either GATA3peak subset summits or unionDHS summits.
#GATA3 peak subset
closest.2nd.distance=read.table('closest.2nd.distance.bed',header=F, comment.char='')
nrow(closest.2nd.distance) #7791
## [1] 7791
nrow(closest.2nd.distance[closest.2nd.distance$V9=="+",]) #3839
## [1] 3839
nrow(closest.2nd.distance[closest.2nd.distance$V9=="-",]) #3952
## [1] 3952
head(closest.2nd.distance)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
## 1 chr1 5598164 5598167 chr1 5598146 5598149 36 36 + GAT 16
## 2 chr1 8017663 8017666 chr1 8017634 8017637 14 14 - GAT 27
## 3 chr1 8020170 8020173 chr1 8020158 8020161 14 14 - GAT 10
## 4 chr1 8055202 8055205 chr1 8055193 8055196 14 14 - GAT 7
## 5 chr1 8061441 8061444 chr1 8061437 8061440 36 36 + GAT 2
## 6 chr1 8120785 8120788 chr1 8120752 8120755 36 36 + GAT 31
#unionDHS
closest.2nd.distance.unionDHS=read.table('closest.2nd.distance.unionDHS.bed',header=F, comment.char='')
nrow(closest.2nd.distance.unionDHS) # 7800
## [1] 7800
nrow(closest.2nd.distance.unionDHS[closest.2nd.distance.unionDHS$V9=="+",]) #3858
## [1] 3858
nrow(closest.2nd.distance.unionDHS[closest.2nd.distance.unionDHS$V9=="-",]) #3942
## [1] 3942
head(closest.2nd.distance.unionDHS)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
## 1 chr1 191000 191003 chr1 191005 191008 14 14 - GAT 3
## 2 chr1 1183537 1183540 chr1 1183725 1183728 14 14 - GAT 186
## 3 chr1 1207290 1207293 chr1 1207289 1207292 36 36 + GAT 0
## 4 chr1 1548964 1548967 chr1 1549048 1549051 36 36 + GAT 82
## 5 chr1 1666192 1666195 chr1 1666232 1666235 36 36 + GAT 38
## 6 chr1 2299000 2299003 chr1 2299054 2299057 36 36 + GAT 52
Subset two data.frames to contain information of the 2nd closest 3mer-GAT info (chr-start-end-strand-dis);
Then add the ‘status’ info and rbind the two dataframe.
df.chip = cbind(closest.2nd.distance[,c(1:3, 9, 11)], "GATA3_peak")
df.ctrl = cbind(closest.2nd.distance.unionDHS[,c(1:3, 9, 11)], "ctrl")
colnames(df.chip) = c(colnames(closest.2nd.distance)[1:3], "strand", "dis", "status")
colnames(df.ctrl) = c(colnames(closest.2nd.distance.unionDHS)[1:3], "strand", "dis", "status")
df.all = rbind(df.chip, df.ctrl)
head(df.all)
## V1 V2 V3 strand dis status
## 1 chr1 5598164 5598167 + 16 GATA3_peak
## 2 chr1 8017663 8017666 - 27 GATA3_peak
## 3 chr1 8020170 8020173 - 10 GATA3_peak
## 4 chr1 8055202 8055205 - 7 GATA3_peak
## 5 chr1 8061441 8061444 + 2 GATA3_peak
## 6 chr1 8120785 8120788 + 31 GATA3_peak
nrow(df.all)
## [1] 15591
df.all$dis=as.numeric(df.all$dis)
df.all$strand=as.factor(df.all$strand)
df.all$status=as.factor(df.all$status)
head(df.all)
## V1 V2 V3 strand dis status
## 1 chr1 5598164 5598167 + 16 GATA3_peak
## 2 chr1 8017663 8017666 - 27 GATA3_peak
## 3 chr1 8020170 8020173 - 10 GATA3_peak
## 4 chr1 8055202 8055205 - 7 GATA3_peak
## 5 chr1 8061441 8061444 + 2 GATA3_peak
## 6 chr1 8120785 8120788 + 31 GATA3_peak
tail(df.all)
## V1 V2 V3 strand dis status
## 15586 chrY 19024627 19024630 + 3 ctrl
## 15587 chrY 19319677 19319680 - 22 ctrl
## 15588 chrY 21220963 21220966 - 12 ctrl
## 15589 chrY 21370822 21370825 - 82 ctrl
## 15590 chrY 24575686 24575689 - 11 ctrl
## 15591 chrY 26323055 26323058 + 10 ctrl
Plot PDF:
# interaction variable
df.all$interaction_var <- interaction(df.all$strand, df.all$status)
library(lattice)
library(latticeExtra)
myColors <- ifelse(df.all$strand == "+", "red", "blue")
densityplot( ~ abs(dis) | status,
groups = strand,
key = list(text = list(as.character(unique(df.all$strand))),
space="top",
lines=list(col=c("red","blue")),
columns=nlevels(df.all$strand)),
data = df.all,
xlim=c(0,40),
from=0,
to=40,
xlab = "distance (bp) from 2nd GAT to 1st GAT",
type = "count",
col = myColors,
lty = 1,
main = "ctrl vs. GATA3 peaks",
par.settings=list(par.xlab.text=list(cex=1.1,font=2),
par.ylab.text=list(cex=1.1,font=2))
)
In this plot, we use probability density function to plot the relative likelihood at different distances of observing a second closest 3mer GAT to the 1st closest GAT.
The left panel is plotting within the control sets and the right panel is plotting the GATA3 peak subset. In each panel, it shows the distances of 2nd GAT on minus strand (blue) or the distances of the 2nd GAT on plus strand (red) to the 1st closest GAT.
In GATA3 peak subset, most 2nd closest GAT are found at ~9bp distance to the 1st closest GAT. On the other hand, in the control set, we did not see such spike at any fixed position.
This suggest that in GATA ChIP peaks, 3mer GAT-GAT are often found to have fixed spacings between them.
In the above analysis we analyzed peak set that are 1) without MEME/STREME found motifs; and 2) in top 20% peak intensity quantile.
In this analysis, we want to include the other 4 lower quantile peaks that without MEME/STREME found motifs.
load files
load files contains selected ChIP peak summit info
chip.peak.summit.quantile1=read.table("quantile1_summits.bed", header=FALSE)
nrow(chip.peak.summit.quantile1)
## [1] 7796
head(chip.peak.summit.quantile1)
## V1 V2 V3 V4
## 1 chr1 5598187 5598188 271.93122
## 2 chr1 8017660 8017661 676.31362
## 3 chr1 8020178 8020179 324.95549
## 4 chr1 8055212 8055213 111.55874
## 5 chr1 8061475 8061476 98.51927
## 6 chr1 8120791 8120792 124.49564
chip.peak.summit.quantile0.8=read.table("quantile0.8_summits.bed", header=FALSE)
chip.peak.summit.quantile0.6=read.table("quantile0.6_summits.bed", header=FALSE)
chip.peak.summit.quantile0.4=read.table("quantile0.4_summits.bed", header=FALSE)
chip.peak.summit.quantile0.2=read.table("quantile0.2_summits.bed", header=FALSE)
nrow(chip.peak.summit.quantile0.8)
## [1] 7796
head(chip.peak.summit.quantile0.8)
## V1 V2 V3 V4
## 1 chr1 1073822 1073823 56.70255
## 2 chr1 5637335 5637336 63.60992
## 3 chr1 7019683 7019684 72.40249
## 4 chr1 7266189 7266190 70.66320
## 5 chr1 7494212 7494213 75.17251
## 6 chr1 8026313 8026314 69.95015
summary(chip.peak.summit.quantile0.8$V4)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 52.89 57.45 63.27 64.85 71.43 83.30
nrow(chip.peak.summit.quantile0.6)
## [1] 7796
head(chip.peak.summit.quantile0.6)
## V1 V2 V3 V4
## 1 chr1 827380 827381 45.32469
## 2 chr1 1074184 1074185 48.18883
## 3 chr1 1124966 1124967 45.46610
## 4 chr1 1157747 1157748 44.19288
## 5 chr1 3587455 3587456 51.91936
## 6 chr1 6199528 6199529 42.31217
summary(chip.peak.summit.quantile0.6$V4)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 40.60 43.11 46.04 46.22 49.20 52.89
nrow(chip.peak.summit.quantile0.4)
## [1] 7796
head(chip.peak.summit.quantile0.4)
## V1 V2 V3 V4
## 1 chr1 916769 916770 36.69993
## 2 chr1 1013265 1013266 34.70221
## 3 chr1 1013580 1013581 36.26922
## 4 chr1 1021142 1021143 32.69951
## 5 chr1 1158192 1158193 36.21743
## 6 chr1 1231880 1231881 35.47152
summary(chip.peak.summit.quantile0.4$V4)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 31.95 34.14 36.16 36.19 38.22 40.59
nrow(chip.peak.summit.quantile0.2)
## [1] 7594
head(chip.peak.summit.quantile0.2)
## V1 V2 V3 V4
## 1 chr1 924853 924854 30.44684
## 2 chr1 966653 966654 26.87016
## 3 chr1 999508 999509 21.22455
## 4 chr1 1000536 1000537 24.34263
## 5 chr1 1001891 1001892 25.52872
## 6 chr1 1020187 1020188 20.63151
summary(chip.peak.summit.quantile0.2$V4)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.6095 24.3426 27.5137 26.4411 29.9572 31.9466
load files contains 3mer coordinates info
#concatenate the plus and minus file together
all.GAT.file=read.table(file = "hg38.3.3.3.30_plus_minus_GAT.bed", sep="\t", header=FALSE)
head(all.GAT.file)
# V1 V2 V3 V4 V5 V6 V7
#1 chr1 11145 11148 14 14 - GAT
#2 chr1 11160 11163 14 14 - GAT
#3 chr1 11576 11579 14 14 - GAT
#4 chr1 13315 13318 14 14 - GAT
#5 chr1 19797 19800 14 14 - GAT
#6 chr1 27023 27026 14 14 - GAT
tail(all.GAT.file)
# V1 V2 V3 V4 V5 V6 V7
#64243042 chrY_KI270740v1_random 33028 33031 36 36 + GAT
#64243043 chrY_KI270740v1_random 35533 35536 36 36 + GAT
#64243044 chrY_KI270740v1_random 35543 35546 36 36 + GAT
#64243045 chrY_KI270740v1_random 35912 35915 36 36 + GAT
#64243046 chrY_KI270740v1_random 36540 36543 36 36 + GAT
#64243047 chrY_KI270740v1_random 36913 36916 36 36 + GAT
nrow(all.GAT.file)
#[1] 64243047
bedtools closestBed
The below function will sort input bed1 and bed2 first, then run bedtools closestBed between bed1 and bed2.
# define function
bedTools.closest <- function(functionstring="/home/FCAM/ssun/packages/bedtools2/bin/closestBed",bed1,bed2,opt.string="") {
options(scipen =99) # not use scientific notation when writing out
#write bed formatted data.frames to tempfile
write.table(bed1,file= 'a.file.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(bed2,file= 'b.file.bed', quote=F,sep="\t",col.names=F,row.names=F)
# create the command string and call the command using system()
# the command sort a and b file by coordinates
command1=paste('sort -k1,1 -k2,2n', 'a.file.bed', '> a.file.sorted.bed')
cat(command1,"\n") #sort -k1,1 -k2,2n a.file.bed > a.file.sorted.bed
try(system(command1))
command2=paste('sort -k1,1 -k2,2n', 'b.file.bed', '> b.file.sorted.bed')
cat(command2,"\n")
try(system(command2))
# the command call closestBed on bed1 and bed2
command=paste(functionstring, opt.string,"-a",'a.file.sorted.bed',"-b",'b.file.sorted.bed',">",'out.file.bed',sep=" ")
cat(command,"\n")
try(system(command))
res=read.table('out.file.bed',header=F, comment.char='')
# remove intermediate files
command3=paste('rm', 'a.file.bed', 'b.file.bed', 'a.file.sorted.bed', 'b.file.sorted.bed', 'out.file.bed')
cat(command3,"\n")
try(system(command3))
colnames(res) = c(colnames(bed1), colnames(bed2), 'dis' )
return(res)
}
Parameter -d will report the distance from the closest GAT to the peak summit.
Parameter -t last or -t first will only report either the last or the first entry in bed2 file if tied distance occurred.
Here we choose to use -t first to report the first coordinates when tie occurred.
Find the closest GAT to peak summit regardless of GAT strandedness
all.distance.quantile1=bedTools.closest(bed1 = chip.peak.summit.quantile1[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
all.distance.quantile0.8=bedTools.closest(bed1 = chip.peak.summit.quantile0.8[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
all.distance.quantile0.6=bedTools.closest(bed1 = chip.peak.summit.quantile0.6[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
all.distance.quantile0.4=bedTools.closest(bed1 = chip.peak.summit.quantile0.4[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
all.distance.quantile0.2=bedTools.closest(bed1 = chip.peak.summit.quantile0.2[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
write.table(all.distance.quantile1,file= 'all.distance.quantile1.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(all.distance.quantile0.8,file= 'all.distance.quantile0.8.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(all.distance.quantile0.6,file= 'all.distance.quantile0.6.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(all.distance.quantile0.4,file= 'all.distance.quantile0.4.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(all.distance.quantile0.2,file= 'all.distance.quantile0.2.bed', quote=F,sep="\t",col.names=F,row.names=F)
save.image('231211_GATA3_ChIP_clsestbed.Rdata')
In these “all.distance.quantileX’ data frames, V1 to V3 are peak summit coordinates, V4 to V9 are closestBed reported closest motif to each peak summit. The last column (V11) is the distance from closest motif to peak summit.
all.distance.quantile1=read.table('all.distance.quantile1.bed',header=F, comment.char='')
head(all.distance.quantile1)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
## 1 chr1 5598187 5598188 chr1 5598164 5598167 36 36 + GAT 21
## 2 chr1 8017660 8017661 chr1 8017663 8017666 36 36 + GAT 3
## 3 chr1 8020178 8020179 chr1 8020170 8020173 14 14 - GAT 6
## 4 chr1 8055212 8055213 chr1 8055202 8055205 36 36 + GAT 8
## 5 chr1 8061475 8061476 chr1 8061441 8061444 36 36 + GAT 32
## 6 chr1 8120791 8120792 chr1 8120785 8120788 14 14 - GAT 4
all.distance.quantile0.8=read.table('all.distance.quantile0.8.bed',header=F, comment.char='')
head(all.distance.quantile1)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
## 1 chr1 5598187 5598188 chr1 5598164 5598167 36 36 + GAT 21
## 2 chr1 8017660 8017661 chr1 8017663 8017666 36 36 + GAT 3
## 3 chr1 8020178 8020179 chr1 8020170 8020173 14 14 - GAT 6
## 4 chr1 8055212 8055213 chr1 8055202 8055205 36 36 + GAT 8
## 5 chr1 8061475 8061476 chr1 8061441 8061444 36 36 + GAT 32
## 6 chr1 8120791 8120792 chr1 8120785 8120788 14 14 - GAT 4
all.distance.quantile0.6=read.table('all.distance.quantile0.6.bed',header=F, comment.char='')
head(all.distance.quantile1)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
## 1 chr1 5598187 5598188 chr1 5598164 5598167 36 36 + GAT 21
## 2 chr1 8017660 8017661 chr1 8017663 8017666 36 36 + GAT 3
## 3 chr1 8020178 8020179 chr1 8020170 8020173 14 14 - GAT 6
## 4 chr1 8055212 8055213 chr1 8055202 8055205 36 36 + GAT 8
## 5 chr1 8061475 8061476 chr1 8061441 8061444 36 36 + GAT 32
## 6 chr1 8120791 8120792 chr1 8120785 8120788 14 14 - GAT 4
all.distance.quantile0.4=read.table('all.distance.quantile0.4.bed',header=F, comment.char='')
head(all.distance.quantile1)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
## 1 chr1 5598187 5598188 chr1 5598164 5598167 36 36 + GAT 21
## 2 chr1 8017660 8017661 chr1 8017663 8017666 36 36 + GAT 3
## 3 chr1 8020178 8020179 chr1 8020170 8020173 14 14 - GAT 6
## 4 chr1 8055212 8055213 chr1 8055202 8055205 36 36 + GAT 8
## 5 chr1 8061475 8061476 chr1 8061441 8061444 36 36 + GAT 32
## 6 chr1 8120791 8120792 chr1 8120785 8120788 14 14 - GAT 4
all.distance.quantile0.2=read.table('all.distance.quantile0.2.bed',header=F, comment.char='')
head(all.distance.quantile1)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
## 1 chr1 5598187 5598188 chr1 5598164 5598167 36 36 + GAT 21
## 2 chr1 8017660 8017661 chr1 8017663 8017666 36 36 + GAT 3
## 3 chr1 8020178 8020179 chr1 8020170 8020173 14 14 - GAT 6
## 4 chr1 8055212 8055213 chr1 8055202 8055205 36 36 + GAT 8
## 5 chr1 8061475 8061476 chr1 8061441 8061444 36 36 + GAT 32
## 6 chr1 8120791 8120792 chr1 8120785 8120788 14 14 - GAT 4
coherence check 1: the number of the closest minus GAT is comparable to the number of the closest plus GAT.
nrow(all.distance.quantile0.8)
## [1] 7796
nrow(all.distance.quantile0.8[all.distance.quantile0.8$V9=="+",])
## [1] 3907
nrow(all.distance.quantile0.8[all.distance.quantile0.8$V9=="-",])
## [1] 3889
nrow(all.distance.quantile0.6)
## [1] 7796
nrow(all.distance.quantile0.6[all.distance.quantile0.6$V9=="+",])
## [1] 3982
nrow(all.distance.quantile0.6[all.distance.quantile0.6$V9=="-",])
## [1] 3814
nrow(all.distance.quantile0.4)
## [1] 7796
nrow(all.distance.quantile0.4[all.distance.quantile0.4$V9=="+",])
## [1] 3983
nrow(all.distance.quantile0.4[all.distance.quantile0.4$V9=="-",])
## [1] 3813
nrow(all.distance.quantile0.2)
## [1] 7594
nrow(all.distance.quantile0.2[all.distance.quantile0.2$V9=="+",])
## [1] 3750
nrow(all.distance.quantile0.2[all.distance.quantile0.2$V9=="-",])
## [1] 3844
coherence check 2: Since we have use -t first to take care of the tie, now we have equal number of unique peak with their closest plus or minus or both GAT.
nrow(chip.peak.summit.quantile0.8) # there are 7796 unique peak
## [1] 7796
nrow(all.distance.quantile0.8)
## [1] 7796
nrow(chip.peak.summit.quantile0.6) # there are 7796 unique peak
## [1] 7796
nrow(all.distance.quantile0.6)
## [1] 7796
nrow(chip.peak.summit.quantile0.4) # there are 7796 unique peak
## [1] 7796
nrow(all.distance.quantile0.4)
## [1] 7796
nrow(chip.peak.summit.quantile0.2) # there are 7594 unique peak
## [1] 7594
nrow(all.distance.quantile0.2)
## [1] 7594
df.chip = data.frame(matrix(nrow = 0, ncol = 5))
colnames(df.chip) = c("V1","V2","v3", "dis", "status")
for (chip.peak in Sys.glob(file.path("./all.distance.quantile*.bed"))) {
print(chip.peak)
quantile.name =strsplit((strsplit(strsplit(chip.peak, "/")[[1]][length(strsplit(chip.peak, "/")[[1]])], 'all.distance.')[[1]][2]), ".bed")[[1]][1]
print(quantile.name)
all.distance = cbind(read.table(chip.peak,header=F, comment.char='')[,c(1:3, 11)], quantile.name)
df.chip = rbind(df.chip, all.distance)
}
## [1] "./all.distance.quantile0.2.bed"
## [1] "quantile0.2"
## [1] "./all.distance.quantile0.4.bed"
## [1] "quantile0.4"
## [1] "./all.distance.quantile0.6.bed"
## [1] "quantile0.6"
## [1] "./all.distance.quantile0.8.bed"
## [1] "quantile0.8"
## [1] "./all.distance.quantile1.bed"
## [1] "quantile1"
str(df.chip)
## 'data.frame': 38778 obs. of 5 variables:
## $ V1 : chr "chr1" "chr1" "chr1" "chr1" ...
## $ V2 : int 924853 966653 999508 1000536 1001891 1020187 1069549 1079599 1163013 1163246 ...
## $ V3 : int 924854 966654 999509 1000537 1001892 1020188 1069550 1079600 1163014 1163247 ...
## $ V11 : int 0 65 61 91 46 143 37 29 213 120 ...
## $ quantile.name: chr "quantile0.2" "quantile0.2" "quantile0.2" "quantile0.2" ...
unique(df.chip$quantile.name)
## [1] "quantile0.2" "quantile0.4" "quantile0.6" "quantile0.8" "quantile1"
combine all single quantile plot into one plot
colnames(df.chip) = c("V1","V2","V3", "dis", "status")
ctrl.distance=read.table('ctrl.distance.bed',header=F, comment.char='')
df.ctrl = cbind(ctrl.distance[,c(1:3, 11)], "ctrl")
colnames(df.ctrl) = c(colnames(ctrl.distance)[1:3], "dis", "status")
head(df.chip)
## V1 V2 V3 dis status
## 1 chr1 924853 924854 0 quantile0.2
## 2 chr1 966653 966654 65 quantile0.2
## 3 chr1 999508 999509 61 quantile0.2
## 4 chr1 1000536 1000537 91 quantile0.2
## 5 chr1 1001891 1001892 46 quantile0.2
## 6 chr1 1020187 1020188 143 quantile0.2
head(df.ctrl)
## V1 V2 V3 dis status
## 1 chr1 190928 190929 72 ctrl
## 2 chr1 1183557 1183558 18 ctrl
## 3 chr1 1207307 1207308 15 ctrl
## 4 chr1 1548961 1548962 3 ctrl
## 5 chr1 1666170 1666171 22 ctrl
## 6 chr1 2298957 2298958 43 ctrl
df.all=rbind(df.chip, df.ctrl)
df.all$status = factor(df.all$status, levels = c("quantile0.2", "quantile0.4", "quantile0.6", "quantile0.8", "quantile1", "ctrl"))
head(df.all)
## V1 V2 V3 dis status
## 1 chr1 924853 924854 0 quantile0.2
## 2 chr1 966653 966654 65 quantile0.2
## 3 chr1 999508 999509 61 quantile0.2
## 4 chr1 1000536 1000537 91 quantile0.2
## 5 chr1 1001891 1001892 46 quantile0.2
## 6 chr1 1020187 1020188 143 quantile0.2
nrow(df.all)
## [1] 46578
plot CDF
We are plotting the cumulative distribution function to evaluate whether a 3mer GAT are closer to the summit of GATA3 ChIP peaks (that might use 3mer GAT as a binding sequence) than they are to the summit of a random subset of union DHS regions (ctrl).
library(lattice)
library(latticeExtra)
ecdfplot(~log(abs(dis), base = 10), groups = status, data = df.all,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
col=c(colorRampPalette(c("pink","red"))(4), "#ce228e", "grey60"),
aspect = 1,
#xlim = c(0, 50000),
scales=list(relation="free",alternating=c(1,1,1,1)),
ylab = 'Cumulative Distribution Function',
xlab = expression('log'[10]~'3mer-GAT Distance from peak summit'),
#index.cond = list(c(2,1)),
between=list(y=1.0),
type = 'a',
xlim = c(0,3.5),
lwd=2,
lty=c(1),
par.settings = list(superpose.line = list(col=c(colorRampPalette(c("pink","red"))(4), "#ce228e","grey60"), lwd=3), strip.background=list(col="grey85")),
panel = function(...) {
panel.abline(v= 1.176, lty =2)
panel.ecdfplot(...)
})
The above cumulative distribution function (PDF) plot is showing the distance between 3mer-GAT to the summit of GATA3 ChIP peak (red), or summit of the ctrl peak set (gray, random subset from the union DHS regions).
The left-shift of the red curve suggests that the 3mer-GAT are closer to the GATA3 ChIP peak set then the ctrl peak set.
The vertical dashed line indicates at which distance (log10 of 15bp) the two CDF traces became parallel. This suggests that the 3mer-GAT often binds within 15 bp of the GATA3 peak summit.
ecdfplot(~dis, groups = status, data = df.all,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
col=c(colorRampPalette(c("pink","red"))(4), "#ce228e","grey60"),
aspect = 1,
#xlim = c(0, 50000),
scales=list(relation="free",alternating=c(1,1,1,1)),
ylab = 'Cumulative Distribution Function',
xlab = '3mer-GAT Distance (bp) from peak summit',
#index.cond = list(c(2,1)),
between=list(y=1.0),
type = 'a',
xlim = c(0,100),
lwd=2,
lty=c(1),
par.settings = list(superpose.line = list(col=c(colorRampPalette(c("pink","red"))(4), "#ce228e","grey60"), lwd=3), strip.background=list(col="grey85")),
panel = function(...) {
panel.abline(v= 15, lty =2)
panel.ecdfplot(...)
})
The difference between this CDF with the previous one is: The above one is plotting actual distance in bp, the previous one is plotting log10 of the distance. The two CDF plots are essentially communicating the same thing.
Input1
This is the 3mer GAT coordinates on hg38 genome (both plus and minus strand) generated with SeqOutbias use read size==30.
wc -l hg38.3.3.3.30_plus_minus_GAT.bed #64243047
head -5 hg38.3.3.3.30_plus_minus_GAT.bed
#chr1 11145 11148 14 14 - GAT
#chr1 11160 11163 14 14 - GAT
#chr1 11576 11579 14 14 - GAT
#chr1 13315 13318 14 14 - GAT
#chr1 19797 19800 14 14 - GAT
Input2
This all.distance.quantileX.bed file is searching for closest 3mer GAT to the peak summit (quantile1: top 20% intense peak without motif 123456789; quantile 0.8: second 20%……).
Use awk to select 3mer-GAT’s coordinates from the file.
for i in all.distance.quantile*.bed
do
nm=$(echo $i | awk -F"all.distance." '{print $2}' | awk -F".bed" '{print $1}')
echo $nm
wc -l $i
head -5 $i
awk '{print $4, $5, $6, $7, $8, $9, $10}' $i | awk '{$1=$1}1' OFS="\t" > closest_1st_GAT_to_GATA_peak_${nm}.bed
wc -l closest_1st_GAT_to_GATA_peak_${nm}.bed
awk '{print $4, $5, $6, $7, $8, $9, $10}' $i | awk '{$1=$1}1' OFS="\t" | sort | uniq > closest_1st_GAT_to_GATA_uniq_peak_${nm}.bed
wc -l closest_1st_GAT_to_GATA_uniq_peak_${nm}.bed
head -5 closest_1st_GAT_to_GATA_uniq_peak_${nm}.bed
awk '$6 == "+" {count++} END {print "Positive strand count:", count}' closest_1st_GAT_to_GATA_uniq_peak_${nm}.bed
awk '$6 == "-" {count++} END {print "Negative strand count:", count}' closest_1st_GAT_to_GATA_uniq_peak_${nm}.bed
done 2>&1 | tee -a closest_1st_GAT_to_GATA_sort_uniq_log.txt
cat closest_1st_GAT_to_GATA_sort_uniq_log.txt
## quantile0.2
## 7594 all.distance.quantile0.2.bed
## chr1 924853 924854 chr1 924851 924854 14 14 - GAT 0
## chr1 966653 966654 chr1 966718 966721 36 36 + GAT 65
## chr1 999508 999509 chr1 999569 999572 14 14 - GAT 61
## chr1 1000536 1000537 chr1 1000627 1000630 36 36 + GAT 91
## chr1 1001891 1001892 chr1 1001937 1001940 36 36 + GAT 46
## 7594 closest_1st_GAT_to_GATA_peak_quantile0.2.bed
## 7563 closest_1st_GAT_to_GATA_uniq_peak_quantile0.2.bed
## chr1 1000627 1000630 36 36 + GAT
## chr1 1001937 1001940 36 36 + GAT
## chr1 100351435 100351438 36 36 + GAT
## chr1 10035675 10035678 36 36 + GAT
## chr1 101025693 101025696 14 14 - GAT
## Positive strand count: 3743
## Negative strand count: 3820
## quantile0.4
## 7796 all.distance.quantile0.4.bed
## chr1 916769 916770 chr1 916719 916722 36 36 + GAT 48
## chr1 1013265 1013266 chr1 1013247 1013250 36 36 + GAT 16
## chr1 1013580 1013581 chr1 1013584 1013587 36 36 + GAT 4
## chr1 1021142 1021143 chr1 1021115 1021118 14 14 - GAT 25
## chr1 1158192 1158193 chr1 1158274 1158277 14 14 - GAT 82
## 7796 closest_1st_GAT_to_GATA_peak_quantile0.4.bed
## 7781 closest_1st_GAT_to_GATA_uniq_peak_quantile0.4.bed
## chr1 100249748 100249751 36 36 + GAT
## chr1 101004290 101004293 14 14 - GAT
## chr1 1013247 1013250 36 36 + GAT
## chr1 1013584 1013587 36 36 + GAT
## chr1 10210228 10210231 36 36 + GAT
## Positive strand count: 3979
## Negative strand count: 3802
## quantile0.6
## 7796 all.distance.quantile0.6.bed
## chr1 827380 827381 chr1 827390 827393 14 14 - GAT 10
## chr1 1074184 1074185 chr1 1074157 1074160 14 14 - GAT 25
## chr1 1124966 1124967 chr1 1125026 1125029 36 36 + GAT 60
## chr1 1157747 1157748 chr1 1157663 1157666 14 14 - GAT 82
## chr1 3587455 3587456 chr1 3587417 3587420 14 14 - GAT 36
## 7796 closest_1st_GAT_to_GATA_peak_quantile0.6.bed
## 7785 closest_1st_GAT_to_GATA_uniq_peak_quantile0.6.bed
## chr1 100038217 100038220 14 14 - GAT
## chr1 10033076 10033079 14 14 - GAT
## chr1 101371391 101371394 36 36 + GAT
## chr1 102772386 102772389 36 36 + GAT
## chr1 10599290 10599293 36 36 + GAT
## Positive strand count: 3980
## Negative strand count: 3805
## quantile0.8
## 7796 all.distance.quantile0.8.bed
## chr1 1073822 1073823 chr1 1073808 1073811 36 36 + GAT 12
## chr1 5637335 5637336 chr1 5637333 5637336 36 36 + GAT 0
## chr1 7019683 7019684 chr1 7019698 7019701 14 14 - GAT 15
## chr1 7266189 7266190 chr1 7266144 7266147 36 36 + GAT 43
## chr1 7494212 7494213 chr1 7494203 7494206 14 14 - GAT 7
## 7796 closest_1st_GAT_to_GATA_peak_quantile0.8.bed
## 7781 closest_1st_GAT_to_GATA_uniq_peak_quantile0.8.bed
## chr1 10514718 10514721 36 36 + GAT
## chr1 1073808 1073811 36 36 + GAT
## chr1 107683212 107683215 36 36 + GAT
## chr1 107688147 107688150 36 36 + GAT
## chr1 107688516 107688519 36 36 + GAT
## Positive strand count: 3897
## Negative strand count: 3884
## quantile1
## 7796 all.distance.quantile1.bed
## chr1 5598187 5598188 chr1 5598164 5598167 36 36 + GAT 21
## chr1 8017660 8017661 chr1 8017663 8017666 36 36 + GAT 3
## chr1 8020178 8020179 chr1 8020170 8020173 14 14 - GAT 6
## chr1 8055212 8055213 chr1 8055202 8055205 36 36 + GAT 8
## chr1 8061475 8061476 chr1 8061441 8061444 36 36 + GAT 32
## 7796 closest_1st_GAT_to_GATA_peak_quantile1.bed
## 7791 closest_1st_GAT_to_GATA_uniq_peak_quantile1.bed
## chr1 100486183 100486186 14 14 - GAT
## chr1 10083257 10083260 36 36 + GAT
## chr1 101797999 101798002 14 14 - GAT
## chr1 102941741 102941744 14 14 - GAT
## chr1 10511140 10511143 36 36 + GAT
## Positive strand count: 3870
## Negative strand count: 3921
Bedtools subtract
-f Requiring a minimal overlap fraction before subtracting. Here we define -f 1.00 to make sure of a 100% overlap between two file before subtracting.
-s Enforcing same “strandedness” while scanning for features in -b file that should be subtracted from -a file.
module load bedtools
#sort -k1,1 -k2,2n hg38.3.3.3.30_plus_minus_GAT.bed > hg38.3.3.3.30_plus_minus_GAT.sorted.bed
#head -5 hg38.3.3.3.30_plus_minus_GAT.sorted.bed
#chr1 11145 11148 14 14 - GAT
#chr1 11160 11163 14 14 - GAT
#chr1 11521 11524 36 36 + GAT
#chr1 11576 11579 14 14 - GAT
#chr1 12188 12191 36 36 + GAT
for i in closest_1st_GAT_to_GATA_uniq_peak_quantile*.bed
do
nm=$(echo $i | awk -F"closest_1st_GAT_to_GATA_uniq_peak_" '{print $2}' | awk -F".bed" '{print $1}')
echo $nm
sort -k1,1 -k2,2n $i > closest_1st_GAT_to_GATA_uniq_peak_${nm}.sorted.bed
head -5 closest_1st_GAT_to_GATA_uniq_peak_${nm}.sorted.bed
bedtools subtract -a hg38.3.3.3.30_plus_minus_GAT.sorted.bed -b closest_1st_GAT_to_GATA_uniq_peak_${nm}.sorted.bed -f 1.00 -s > hg38.3.3.3.30_plus_minus_GAT_without_1st_closest_GAT_${nm}.bed
wc -l hg38.3.3.3.30_plus_minus_GAT.sorted.bed #64243047
wc -l closest_1st_GAT_to_GATA_uniq_peak_${nm}.sorted.bed
wc -l hg38.3.3.3.30_plus_minus_GAT_without_1st_closest_GAT_${nm}.bed
done 2>&1 | tee -a closest_1st_GAT_to_GATA_Bedsubtract_log.txt
cat closest_1st_GAT_to_GATA_Bedsubtract_log.txt
## quantile0.2
## chr1 924851 924854 14 14 - GAT
## chr1 966718 966721 36 36 + GAT
## chr1 999569 999572 14 14 - GAT
## chr1 1000627 1000630 36 36 + GAT
## chr1 1001937 1001940 36 36 + GAT
## 64243047 hg38.3.3.3.30_plus_minus_GAT.sorted.bed
## 7563 closest_1st_GAT_to_GATA_uniq_peak_quantile0.2.sorted.bed
## 64235484 hg38.3.3.3.30_plus_minus_GAT_without_1st_closest_GAT_quantile0.2.bed
## quantile0.4
## chr1 916719 916722 36 36 + GAT
## chr1 1013247 1013250 36 36 + GAT
## chr1 1013584 1013587 36 36 + GAT
## chr1 1021115 1021118 14 14 - GAT
## chr1 1158274 1158277 14 14 - GAT
## 64243047 hg38.3.3.3.30_plus_minus_GAT.sorted.bed
## 7781 closest_1st_GAT_to_GATA_uniq_peak_quantile0.4.sorted.bed
## 64235266 hg38.3.3.3.30_plus_minus_GAT_without_1st_closest_GAT_quantile0.4.bed
## quantile0.6
## chr1 827390 827393 14 14 - GAT
## chr1 1074157 1074160 14 14 - GAT
## chr1 1125026 1125029 36 36 + GAT
## chr1 1157663 1157666 14 14 - GAT
## chr1 3587417 3587420 14 14 - GAT
## 64243047 hg38.3.3.3.30_plus_minus_GAT.sorted.bed
## 7785 closest_1st_GAT_to_GATA_uniq_peak_quantile0.6.sorted.bed
## 64235262 hg38.3.3.3.30_plus_minus_GAT_without_1st_closest_GAT_quantile0.6.bed
## quantile0.8
## chr1 1073808 1073811 36 36 + GAT
## chr1 5637333 5637336 36 36 + GAT
## chr1 7019698 7019701 14 14 - GAT
## chr1 7266144 7266147 36 36 + GAT
## chr1 7494203 7494206 14 14 - GAT
## 64243047 hg38.3.3.3.30_plus_minus_GAT.sorted.bed
## 7781 closest_1st_GAT_to_GATA_uniq_peak_quantile0.8.sorted.bed
## 64235266 hg38.3.3.3.30_plus_minus_GAT_without_1st_closest_GAT_quantile0.8.bed
## quantile1
## chr1 5598164 5598167 36 36 + GAT
## chr1 8017663 8017666 36 36 + GAT
## chr1 8020170 8020173 14 14 - GAT
## chr1 8055202 8055205 36 36 + GAT
## chr1 8061441 8061444 36 36 + GAT
## 64243047 hg38.3.3.3.30_plus_minus_GAT.sorted.bed
## 7791 closest_1st_GAT_to_GATA_uniq_peak_quantile1.sorted.bed
## 64235256 hg38.3.3.3.30_plus_minus_GAT_without_1st_closest_GAT_quantile1.bed
Input3
bedtools subtract
wc -l hg38.3.3.3.30_plus_minus_GAT.sorted.bed #64243047
wc -l closest_1st_GAT_to_unionDHS_uniq_peak.sorted.bed #7800
wc -l hg38.3.3.3.30_plus_minus_GAT_without_1st_closest_GAT_unionDHS.bed #64235247
#64243047-7800
#[1] 64235247
bedtools closestBedGATA3 peak subset
Read the file (1st closest GAT to GATA3 peak subset summit) in.
Read the GAT coorrdinates file without the 1st closest GAT to GATA3 peak summit (5 quantile subsets that ranked by intensity of peaks without GATA3-like motifs 123456789).
Use bedtools closestBed to find the 2nd closest GAT to the 1st closest GAT use the output from bedtools subtract.
#load the bedTools.closest function first
for (closest_1st_GAT in Sys.glob(file.path("./closest_1st_GAT_to_GATA_uniq_peak_quantile*sorted.bed"))) {
print(closest_1st_GAT)
quantile.name =strsplit((strsplit(strsplit(closest_1st_GAT, "/")[[1]][length(strsplit(closest_1st_GAT, "/")[[1]])], 'closest_1st_GAT_to_GATA_uniq_peak_')[[1]][2]), ".sorted.bed")[[1]][1]
print(quantile.name)
closest_1st_GAT_to_GATA=read.table(closest_1st_GAT, header=FALSE)
all.GAT.without.1st.closest.file=read.table(file = paste0("hg38.3.3.3.30_plus_minus_GAT_without_1st_closest_GAT_", quantile.name, '.bed'), sep="\t", header=FALSE)
closest.2nd.distance=bedTools.closest(bed1 = closest_1st_GAT_to_GATA[,1:3], bed2 = all.GAT.without.1st.closest.file, opt.string = '-d -t first')
write.table(closest.2nd.distance,file=paste0("closest.2nd.distance.", quantile.name, '.bed'), quote=F,sep="\t",col.names=F,row.names=F)
}
save.image('231211_GATA3_ChIP_2nd_clsestbed.Rdata')
UnionDHS regions
This part has already completed in previous analysis, file is saved in ‘closest.2nd.distance.unionDHS.bed’.
In the previous section, we have found the 2nd closest GAT to the 1st closest GAT to either GATA3peak subset summits or unionDHS summits.
for i in closest.2nd.distance.quantile*.bed
do
echo $i
wc -l $i
head -5 $i
awk '$9 == "+" {count++} END {print "Positive strand count:", count}' $i
awk '$9 == "-" {count++} END {print "Negative strand count:", count}' $i
done 2>&1 | tee -a closest_2nd_distance_log.txt
cat closest_2nd_distance_log.txt
## closest.2nd.distance.quantile0.2.bed
## 7563 closest.2nd.distance.quantile0.2.bed
## chr1 924851 924854 chr1 924776 924779 14 14 - GAT 73
## chr1 966718 966721 chr1 966791 966794 14 14 - GAT 71
## chr1 999569 999572 chr1 999577 999580 36 36 + GAT 6
## chr1 1000627 1000630 chr1 1000628 1000631 14 14 - GAT 0
## chr1 1001937 1001940 chr1 1002046 1002049 14 14 - GAT 107
## Positive strand count: 3743
## Negative strand count: 3820
## closest.2nd.distance.quantile0.4.bed
## 7781 closest.2nd.distance.quantile0.4.bed
## chr1 916719 916722 chr1 916704 916707 36 36 + GAT 13
## chr1 1013247 1013250 chr1 1013234 1013237 14 14 - GAT 11
## chr1 1013584 1013587 chr1 1013519 1013522 14 14 - GAT 63
## chr1 1021115 1021118 chr1 1021098 1021101 14 14 - GAT 15
## chr1 1158274 1158277 chr1 1158525 1158528 36 36 + GAT 249
## Positive strand count: 3917
## Negative strand count: 3864
## closest.2nd.distance.quantile0.6.bed
## 7785 closest.2nd.distance.quantile0.6.bed
## chr1 827390 827393 chr1 827303 827306 14 14 - GAT 85
## chr1 1074157 1074160 chr1 1073954 1073957 36 36 + GAT 201
## chr1 1125026 1125029 chr1 1125064 1125067 36 36 + GAT 36
## chr1 1157663 1157666 chr1 1157653 1157656 14 14 - GAT 8
## chr1 3587417 3587420 chr1 3587362 3587365 36 36 + GAT 53
## Positive strand count: 3833
## Negative strand count: 3952
## closest.2nd.distance.quantile0.8.bed
## 7781 closest.2nd.distance.quantile0.8.bed
## chr1 1073808 1073811 chr1 1073954 1073957 36 36 + GAT 144
## chr1 5637333 5637336 chr1 5637324 5637327 36 36 + GAT 7
## chr1 7019698 7019701 chr1 7019703 7019706 36 36 + GAT 3
## chr1 7266144 7266147 chr1 7266136 7266139 14 14 - GAT 6
## chr1 7494203 7494206 chr1 7494226 7494229 36 36 + GAT 21
## Positive strand count: 3955
## Negative strand count: 3826
## closest.2nd.distance.quantile1.bed
## 7791 closest.2nd.distance.quantile1.bed
## chr1 5598164 5598167 chr1 5598146 5598149 36 36 + GAT 16
## chr1 8017663 8017666 chr1 8017634 8017637 14 14 - GAT 27
## chr1 8020170 8020173 chr1 8020158 8020161 14 14 - GAT 10
## chr1 8055202 8055205 chr1 8055193 8055196 14 14 - GAT 7
## chr1 8061441 8061444 chr1 8061437 8061440 36 36 + GAT 2
## Positive strand count: 3839
## Negative strand count: 3952
Subset two data.frames to contain information of the 2nd closest 3mer-GAT info (chr-start-end-strand-dis);
Then add the ‘status’ info and rbind the two dataframe.
df.chip = data.frame(matrix(nrow = 0, ncol = 6))
for (chip.peak in Sys.glob(file.path("./closest.2nd.distance.quantile*.bed"))) {
print(chip.peak)
quantile.name =strsplit((strsplit(strsplit(chip.peak, "/")[[1]][length(strsplit(chip.peak, "/")[[1]])], 'closest.2nd.distance.')[[1]][2]), ".bed")[[1]][1]
print(quantile.name)
all.distance = cbind(read.table(chip.peak,header=F, comment.char='')[,c(4:6, 9, 11)], quantile.name)
df.chip = rbind(df.chip, all.distance)
}
## [1] "./closest.2nd.distance.quantile0.2.bed"
## [1] "quantile0.2"
## [1] "./closest.2nd.distance.quantile0.4.bed"
## [1] "quantile0.4"
## [1] "./closest.2nd.distance.quantile0.6.bed"
## [1] "quantile0.6"
## [1] "./closest.2nd.distance.quantile0.8.bed"
## [1] "quantile0.8"
## [1] "./closest.2nd.distance.quantile1.bed"
## [1] "quantile1"
str(df.chip)
## 'data.frame': 38701 obs. of 6 variables:
## $ V4 : chr "chr1" "chr1" "chr1" "chr1" ...
## $ V5 : int 924776 966791 999577 1000628 1002046 1020331 1069481 1079763 1162642 1163472 ...
## $ V6 : int 924779 966794 999580 1000631 1002049 1020334 1069484 1079766 1162645 1163475 ...
## $ V9 : chr "-" "-" "+" "-" ...
## $ V11 : int 73 71 6 0 107 0 27 193 154 104 ...
## $ quantile.name: chr "quantile0.2" "quantile0.2" "quantile0.2" "quantile0.2" ...
unique(df.chip$quantile.name)
## [1] "quantile0.2" "quantile0.4" "quantile0.6" "quantile0.8" "quantile1"
combine all single quantile plot into one plot
colnames(df.chip) = c("V1","V2","v3", "strand", "dis", "status")
closest.2nd.distance.unionDHS=read.table('closest.2nd.distance.unionDHS.bed',header=F, comment.char='')
df.ctrl = cbind(closest.2nd.distance.unionDHS[,c(4:6, 9, 11)], "ctrl")
colnames(df.ctrl) = c("V1","V2","v3", "strand", "dis", "status")
head(df.chip)
## V1 V2 v3 strand dis status
## 1 chr1 924776 924779 - 73 quantile0.2
## 2 chr1 966791 966794 - 71 quantile0.2
## 3 chr1 999577 999580 + 6 quantile0.2
## 4 chr1 1000628 1000631 - 0 quantile0.2
## 5 chr1 1002046 1002049 - 107 quantile0.2
## 6 chr1 1020331 1020334 - 0 quantile0.2
head(df.ctrl)
## V1 V2 v3 strand dis status
## 1 chr1 191005 191008 - 3 ctrl
## 2 chr1 1183725 1183728 - 186 ctrl
## 3 chr1 1207289 1207292 + 0 ctrl
## 4 chr1 1549048 1549051 + 82 ctrl
## 5 chr1 1666232 1666235 + 38 ctrl
## 6 chr1 2299054 2299057 + 52 ctrl
df.all=rbind(df.chip, df.ctrl)
df.all$status = factor(df.all$status, levels = c("quantile1", "quantile0.8", "quantile0.6", "quantile0.4", "quantile0.2", "ctrl"))
df.all$dis=as.integer(df.all$dis)
df.all$strand=as.factor(df.all$strand)
head(df.all)
## V1 V2 v3 strand dis status
## 1 chr1 924776 924779 - 73 quantile0.2
## 2 chr1 966791 966794 - 71 quantile0.2
## 3 chr1 999577 999580 + 6 quantile0.2
## 4 chr1 1000628 1000631 - 0 quantile0.2
## 5 chr1 1002046 1002049 - 107 quantile0.2
## 6 chr1 1020331 1020334 - 0 quantile0.2
nrow(df.all)
## [1] 46501
Plot PDF:
library(lattice)
library(latticeExtra)
densityplot( ~ abs(dis) | status,
groups = strand,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
data = df.all,
col=c("blue", "red"),
aspect = 2.5,
xlim=c(0,50),
from=0,
to=50,
layout=c(6,1),
xlab = "distance (bp) from 2nd GAT to 1st GAT",
main = "ctrl vs. GATA3 peaks",
#between=list(y=1.0),
type = "count",
lty = c(1),
lwd=2,
par.settings = list(superpose.line = list(col=c("blue", "red"), lwd=3), strip.background=list(col="grey85")),
panel = function(...) {
panel.abline(v= 9, lty =2)
panel.abline(v= 2, lty =2)
panel.densityplot(...)
})
In this plot, we use probability density function to plot the relative likelihood at different distances of observing a second closest 3mer GAT to the 1st closest GAT.
It is showing a dosage effect when we decreased peak intensity (from quantile 1 to quantile 0.2). The best set is quantile 1.
make a single plot
library(lattice)
library(latticeExtra)
densityplot( ~ abs(dis),
groups = status,
data = df.all,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
col=c("#ce228e", colorRampPalette(c("red","pink"))(4), "grey60"),
aspect = 1,
xlim=c(0,50),
from=0,
to=50,
#layout=c(1,2),
xlab = "distance (bp) from 2nd GAT to 1st GAT",
main = "ctrl vs. GATA3 peaks",
between=list(y=1.0),
type = "count",
lty = c(1),
lwd=2,
par.settings = list(superpose.line = list(col=c("#ce228e", colorRampPalette(c("red","pink"))(4),"grey60"), lwd=3), strip.background=list(col="grey85")),
panel = function(...) {
panel.abline(v= 9, lty =2)
panel.abline(v= 2, lty =2)
panel.densityplot(...)
})
In the above PDF plot, the spike is at 2bp distance and 9bp distance. It is important to know that this is repoting the likelihood, not the actual counts at this two location.
If we count and compare how many entries has 0, 1, 2, 9, 10, 11bp distance, we would see that the counts is apparently higher in 0 and 10bp than in 2 and 9bp.
head closest.2nd.distance.quantile1.bed
wc -l closest.2nd.distance.quantile1.bed
awk '$11 == 0' closest.2nd.distance.quantile1.bed | wc -l
awk '$11 == 1' closest.2nd.distance.quantile1.bed | wc -l
awk '$11 == 2' closest.2nd.distance.quantile1.bed | wc -l
awk '$11 == 9' closest.2nd.distance.quantile1.bed | wc -l
awk '$11 == 10' closest.2nd.distance.quantile1.bed | wc -l
awk '$11 == 11' closest.2nd.distance.quantile1.bed | wc -l
## chr1 5598164 5598167 chr1 5598146 5598149 36 36 + GAT 16
## chr1 8017663 8017666 chr1 8017634 8017637 14 14 - GAT 27
## chr1 8020170 8020173 chr1 8020158 8020161 14 14 - GAT 10
## chr1 8055202 8055205 chr1 8055193 8055196 14 14 - GAT 7
## chr1 8061441 8061444 chr1 8061437 8061440 36 36 + GAT 2
## chr1 8120785 8120788 chr1 8120752 8120755 36 36 + GAT 31
## chr1 8182615 8182618 chr1 8182616 8182619 14 14 - GAT 0
## chr1 8197832 8197835 chr1 8197815 8197818 14 14 - GAT 15
## chr1 8204648 8204651 chr1 8204655 8204658 36 36 + GAT 5
## chr1 8258971 8258974 chr1 8258981 8258984 14 14 - GAT 8
## 7791 closest.2nd.distance.quantile1.bed
## 1095
## 187
## 137
## 317
## 524
## 446
Probability density plot is reporting the likelihood (y axis) of observing an event (in our case, have a 2nd closest GAT at n position) in a defined range (this is defined by from and to, in the above plot it is 50).
It is important to set from and to arguments inside density(). Because by default, density will extend the range so that the density curve approaches 0 at the extreme.
If I change the defined range from (0,50) to (0,1000), the plot would look different:
densityplot( ~ abs(dis),
groups = status,
data = df.all,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
col=c("#ce228e", colorRampPalette(c("red","pink"))(4), "grey60"),
aspect = 1,
xlim=c(0,50),
from=0,
to=1000,
#layout=c(1,2),
xlab = "distance (bp) from 2nd GAT to 1st GAT",
main = "ctrl vs. GATA3 peaks",
between=list(y=1.0),
type = "count",
lty = c(1),
lwd=2,
par.settings = list(superpose.line = list(col=c("#ce228e", colorRampPalette(c("red","pink"))(4),"grey60"), lwd=3), strip.background=list(col="grey85")),
panel = function(...) {
panel.abline(v= 10, lty =2)
panel.abline(v= 2, lty =2)
panel.densityplot(...)
})
Now the highest spike in quantile1 became 10bp distance.
histogram
The density plots are good to visualize the distribution of continuous numeric variables. The densityplot() uses kernel density probability estimate to calculate the density probability of numeric variables.
Since we have discrete distance (in single bp resolution), histogram might be more suitable.
Histograms are constructed by binning the data and counting the number of observations in each bin. We can set the binwidth to be 1bp.
str(df.all)
## 'data.frame': 46501 obs. of 6 variables:
## $ V1 : chr "chr1" "chr1" "chr1" "chr1" ...
## $ V2 : int 924776 966791 999577 1000628 1002046 1020331 1069481 1079763 1162642 1163472 ...
## $ v3 : int 924779 966794 999580 1000631 1002049 1020334 1069484 1079766 1162645 1163475 ...
## $ strand: Factor w/ 2 levels "-","+": 1 1 2 1 1 1 1 2 1 2 ...
## $ dis : int 73 71 6 0 107 0 27 193 154 104 ...
## $ status: Factor w/ 6 levels "quantile1","quantile0.8",..: 5 5 5 5 5 5 5 5 5 5 ...
summary(df.all$dis)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 7.00 20.00 38.63 48.00 24806.00
histogram(~ abs(dis) | status,
groups = strand,
key = list(space = "right", text=list(c("- strand", "+ strand"), col=c("blue","red"), cex=1)),
data = df.all,
col=c("blue", "red"),
aspect = 3,
xlim=c(0,20),
breaks=seq(from=0,to=24806, by=1),
layout=c(6,1),
xlab = "distance (bp) from 2nd GAT to 1st GAT",
main = "ctrl vs. GATA3 peaks",
between=list(y=1.0),
type="density",
lwd=2,
par.settings = list(superpose.line = list(col=c("blue", "red"), lwd=3), strip.background=list(col="grey85")),
panel = function(...) {
panel.abline(v= 10, lty =2)
panel.abline(v= 1, lty =2)
panel.densityplot(...)
panel.histogram(...)
})
str(df.all)
## 'data.frame': 46501 obs. of 6 variables:
## $ V1 : chr "chr1" "chr1" "chr1" "chr1" ...
## $ V2 : int 924776 966791 999577 1000628 1002046 1020331 1069481 1079763 1162642 1163472 ...
## $ v3 : int 924779 966794 999580 1000631 1002049 1020334 1069484 1079766 1162645 1163475 ...
## $ strand: Factor w/ 2 levels "-","+": 1 1 2 1 1 1 1 2 1 2 ...
## $ dis : int 73 71 6 0 107 0 27 193 154 104 ...
## $ status: Factor w/ 6 levels "quantile1","quantile0.8",..: 5 5 5 5 5 5 5 5 5 5 ...
summary(df.all$dis)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 7.00 20.00 38.63 48.00 24806.00
histogram(~ abs(dis),
groups = status,
data=df.all,
auto.key = list(space = "right", rectangles=FALSE, lines=TRUE, points=FALSE, cex = 1),
#col=c("#ce228e", colorRampPalette(c("red","pink"))(4), "grey60"),
aspect = 1,
xlim=c(0,30),
breaks=seq(from=0,to=max(df.all$dis), by=1),
xlab = "distance (bp) from 2nd GAT to 1st GAT",
main = "ctrl vs. GATA3 peaks",
between=list(y=1.0),
type="density",
lwd=2,
par.settings = list(superpose.line = list(col=c("#ce228e", colorRampPalette(c("red","pink"))(4),"grey60"), lwd=3), strip.background=list(col="grey85")),
panel = function(...) {
panel.abline(v= 10, lty =2, col="red")
panel.abline(v= 1, lty =2, col="red")
panel.histogram(...)
panel.densityplot(...) #darg=list(bw = 0.5, kernel="gaussian"),
panel.superpose(..., panel.groups=panel.histogram,
col=c("#ce228e", colorRampPalette(c("red","pink"))(4),"grey60"),alpha=0.4)
})
Conclusion: I think, quantile1 (top 20% intensity) is where we could make the cutoff in the entire GATA3 ChIP peaks. These peaks are: 1) higher intensity; 2) more than 70% peaks has GATA-like motifs found by MEME/STREME; 3) for peaks lack MEME/STREME found motif, we have good enrichment of GAT near peak summit, and a good 3mer-3mer structure.
files to upload:
-3mer-GAT coordinates file (+/-) (in .bigWig format);
-several annotation bed files including: 1) peak summit files; 2) closest GAT to peak summit; 3) 2nd closest GAT to peak summit.
library(bigWig)
peak.region.summit=center.bed(read.table('/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/de_novo_motif/without_motifs_123456_789.bed', sep = "\t", header=FALSE), upstreamWindow = 0, downstreamWindow = 0)
write.table(peak.region.summit,file= 'without_motifs_123456_789_summits.bed', quote=F,sep="\t",col.names=F,row.names=F)
# 1)
cat without_motifs_123456_789_summits.bed | sort -k1,1 -k2,2n > without_motifs_123456_789_summits.sorted.bed
# 2)
cat closest_1st_GAT_to_GATA_uniq_peak_quantile0.2.bed closest_1st_GAT_to_GATA_uniq_peak_quantile0.4.bed closest_1st_GAT_to_GATA_uniq_peak_quantile0.6.bed closest_1st_GAT_to_GATA_uniq_peak_quantile0.8.bed closest_1st_GAT_to_GATA_uniq_peak_quantile1.bed | sort -k1,1 -k2,2n > closest_1st_GAT_to_GATA_uniq_peak_quantile_all.sorted.bed
awk '{print $1, $2, $3, $4, $5, $6}' closest_1st_GAT_to_GATA_uniq_peak_quantile_all.sorted.bed | awk '{$1=$1}1' OFS="\t" > closest_1st_GAT_to_GATA_uniq_peak_quantile_all.sorted2.bed
# 3)
cat closest.2nd.distance.quantile1.bed closest.2nd.distance.quantile0.8.bed closest.2nd.distance.quantile0.2.bed closest.2nd.distance.quantile0.4.bed closest.2nd.distance.quantile0.6.bed | sort -k1,1 -k2,2n > closest.2nd.distance.quantile.all.sorted.bed
awk '{print $4, $5, $6, $7, $8, $9}' closest.2nd.distance.quantile.all.sorted.bed | awk '{$1=$1}1' OFS="\t" > closest.2nd.distance.quantile.all.sorted2.bed
Add trackline:
awk 'BEGIN {print "browser position chr10:16,000-17,000"
print "track type=bed name=\"peak.withoutmotifs.summit.bed\" description=\"peak.withoutmotifs.summit_bed\" visibility=full autoScale=on useScore=1 color=0,0,0"
} {print $0}' without_motifs_123456_789_summits.sorted.bed > without_motifs_123456_789_summits.sorted.header.bed
awk 'BEGIN {print "browser position chr10:16,000-17,000"
print "track type=bed name=\"closest.1st.GAT_bed\" description=\"closest.1st.GAT_bed\" visibility=full autoScale=on useScore=1 color=255,0,0"
} {print $0}' closest_1st_GAT_to_GATA_uniq_peak_quantile_all.sorted2.bed > closest_1st_GAT_to_GATA_uniq_peak_quantile_all.sorted.header.bed
awk 'BEGIN {print "browser position chr10:16,000-17,000"
print "track type=bed name=\"closest.2nd.GAT_bed\" description=\"closest.2nd.GAT_bed\" visibility=full autoScale=on useScore=1 color=0,0,255"
} {print $0}' closest.2nd.distance.quantile.all.sorted2.bed > closest.2nd.distance.quantile.all.sorted.header.bed
convert to bigWig
sort the .bed file:
#GAT
dir=/labs/Guertin/siyu/Sathyan_GATA3_ChIP_pool1_pool2/overrep_3mer/hg38_full_kmer3_rs30/seqdump/
plus_file=${dir}hg38.3.3.3plus.36_GAT.bed
minus_file=${dir}hg38.3.3.3minus.14_GAT.bed
echo "plus_GAT (.bed)"
wc -l ${plus_file}
head -5 ${plus_file}
echo "minus_GAT (.bed)"
wc -l ${minus_file}
head -5 ${minus_file}
sort -k1,1 -k2,2n ${plus_file} > hg38.3.3.3plus.36_GAT_sorted.bed
sort -k1,1 -k2,2n ${minus_file} > hg38.3.3.3minus.14_GAT_sorted.bed
wc -l hg38.3.3.3plus.36_GAT_sorted.bed
wc -l hg38.3.3.3minus.14_GAT_sorted.bed
Use bedtools genomecov -bg to convert .bed to .bedGraph.
module load bedtools
module load ucsc_genome/2012.05.22
sizes=/home/FCAM/ssun/Genome_pro/hg38.chrom.sizes
#bedtools genomecov tool
bedtools genomecov -bg -i hg38.3.3.3plus.36_GAT_sorted.bed -g ${sizes} > hg38.3.3.3plus.36_GAT.bedGraph
bedtools genomecov -bg -i hg38.3.3.3minus.14_GAT_sorted.bed -g ${sizes} > hg38.3.3.3minus.14_GAT.bedGraph
wc -l hg38.3.3.3plus.36_GAT.bedGraph
wc -l hg38.3.3.3minus.14_GAT.bedGraph
The final step is to convert .bedGraph to .bigWig with UCSC bedGraphToBigWig tool.
#UCSC bedGraphToBigWig tool
bedGraphToBigWig hg38.3.3.3plus.36_GAT.bedGraph ${sizes} hg38.3.3.3plus.36_GAT.bigWig
bedGraphToBigWig hg38.3.3.3minus.14_GAT.bedGraph ${sizes} hg38.3.3.3minus.14_GAT.bigWig
cat UCSCGB.sh_7536738.out
## plus_GAT (.bed)
## 32147038 /labs/Guertin/siyu/Sathyan_GATA3_ChIP_pool1_pool2/overrep_3mer/hg38_full_kmer3_rs30/seqdump/hg38.3.3.3plus.36_GAT.bed
## chr1 11521 11524 36 36 + GAT
## chr1 12188 12191 36 36 + GAT
## chr1 13274 13277 36 36 + GAT
## chr1 13314 13317 36 36 + GAT
## chr1 15171 15174 36 36 + GAT
## minus_GAT (.bed)
## 32096009 /labs/Guertin/siyu/Sathyan_GATA3_ChIP_pool1_pool2/overrep_3mer/hg38_full_kmer3_rs30/seqdump/hg38.3.3.3minus.14_GAT.bed
## chr1 11145 11148 14 14 - GAT
## chr1 11160 11163 14 14 - GAT
## chr1 11576 11579 14 14 - GAT
## chr1 13315 13318 14 14 - GAT
## chr1 19797 19800 14 14 - GAT
## 32147038 hg38.3.3.3plus.36_GAT_sorted.bed
## 32096009 hg38.3.3.3minus.14_GAT_sorted.bed
## 31502719 hg38.3.3.3plus.36_GAT.bedGraph
## 31455328 hg38.3.3.3minus.14_GAT.bedGraph
**Note that the .bedGraph converted from .bed with bedtools genomecov has less genome coordinates entry. This is because the original .bed file contains adjacent regions, and bedtools genomecov collapse these regions into a single entry in the BEDGraph file by merging or summarizing the coverage data.
Trackhub link: http://guertinlab.cam.uchc.edu/over3mer_hub/hub.txt
The annotation BED files need to be upload via Custom Track.
Sved Section:
Figure 2: UCSC Genome browser image
I want to locate regions where the distance between closest 1st and closest 2nd GAT are 0bp, 1bp, 2bp, 9bp and 10bp.
Figure 3: UCSC Genome browser image: the distance between closest 1st and closest 2nd GAT is 0bp
Figure 4: UCSC Genome browser image: the distance between closest 1st and closest 2nd GAT is 1bp
Figure 5: UCSC Genome browser image: the distance between closest 1st and closest 2nd GAT is 2bp
Figure 6: UCSC Genome browser image: the distance between closest 1st and closest 2nd GAT is 9bp
Figure 7: UCSC Genome browser image: the distance between closest 1st and closest 2nd GAT is 10bp
Notice that if closestBed found overlap features, the distance will be reported as 0; if no overlaps are found, closestBed will look for the feature in B that is closest (that is, least genomic distance to the start or end of A) to A.
If B is upstream of A, then it will use the start of A to subtract the end of B, then add 1;
If B is downstream of A, then it will use the start of B to subtract the end of A, then add 1.
Since closestBed is adding 1bp in its calculation of distance, the actual fixed spacings between two 3mer will be distance -1.
Besides the union DHS regions from ENCODE, we also performed CTCF ChIP-seq with MCF7 cells, which could serve as an alternative negative control for this analysis.
Review of CTCF ChIP-seq:
Using MACS3 and two biological replicates of CTCF_CC libraries, we identified 94738 CTCF ChIP peaks.
To make this a proper control for our GATA3 peak sets (that without MEME/STREME motifs), we use MAST to remove CTCF peaks regions that contains GATA3-like motif 123456789 (use same p-value stringency).
We have 9 motifs that found by MEME/STREME software.
ls GATA3_MEME_STREME_motifs/
#cat GATA3_MEME_STREME_motifs/meme_1.txt
## AGATAAM_streme.txt
## AGATDNHATCT_streme.txt
## AGATNDWNAGATARN_meme.txt_meme_4.txt
## BTTATCWGATB_meme_5.txt_meme.txt
## TGATAA_streme.txt
## WGATBDHRVAGATAA_meme.txt_meme_6.txt
## meme_1.txt
## meme_2.txt
## meme_3.txt
First convert CTCF ChIP-peak file (100bp window) from .bed to .fasta.
#cd /home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/overrep_3mer/CTCF_peak_without_123456789
module load bedtools
genome=/home/FCAM/ssun/Genome/hg38.fa
dir=/home/FCAM/ssun/GATA3_ChIP_PRO_July2023/ChIP_final/peak_call/CTCF_peak/
fastaFromBed -fi $genome -bed ${dir}CTCF_ChIP_summit_100window.bed -fo CTCF_ChIP_summit_100window.fasta
wc -l ${dir}CTCF_ChIP_summit_100window.bed #94738
MEME (mast uses default p-value: 0.0001)
module load meme/5.4.1
module load R/4.1.2
module load bedtools
genome=/home/FCAM/ssun/Genome/hg38.fa
#round1
mast -hit_list -best ../GATA3_MEME_STREME_motifs/meme_1.txt CTCF_ChIP_summit_100window.fasta > mast_GATA3_PSWM_in_CTCF_round1.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_CTCF_round1.txt
wc -l mast_GATA3_PSWM_in_CTCF_round1.bed #584
intersectBed -v -a ${dir}CTCF_ChIP_summit_100window.bed -b mast_GATA3_PSWM_in_CTCF_round1.bed > CTCF_without_motifs_1.bed
wc -l CTCF_without_motifs_1.bed #94154
#round2
fastaFromBed -fi $genome -bed CTCF_without_motifs_1.bed -fo CTCF_without_motifs_1.fasta
mast -hit_list -best ../GATA3_MEME_STREME_motifs/meme_2.txt CTCF_without_motifs_1.fasta > mast_GATA3_PSWM_in_CTCF_round2.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_CTCF_round2.txt
wc -l mast_GATA3_PSWM_in_CTCF_round2.bed #675
intersectBed -v -a CTCF_without_motifs_1.bed -b mast_GATA3_PSWM_in_CTCF_round2.bed > CTCF_without_motifs_12.bed
wc -l CTCF_without_motifs_12.bed #93479
#round3
fastaFromBed -fi $genome -bed CTCF_without_motifs_12.bed -fo CTCF_without_motifs_12.fasta
mast -hit_list -best ../GATA3_MEME_STREME_motifs/meme_3.txt CTCF_without_motifs_12.fasta > mast_GATA3_PSWM_in_CTCF_round3.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_CTCF_round3.txt
wc -l mast_GATA3_PSWM_in_CTCF_round3.bed #541
intersectBed -v -a CTCF_without_motifs_12.bed -b mast_GATA3_PSWM_in_CTCF_round3.bed > CTCF_without_motifs_123.bed
wc -l CTCF_without_motifs_123.bed # 92938
#round4
fastaFromBed -fi $genome -bed CTCF_without_motifs_123.bed -fo CTCF_without_motifs_123.fasta
mast -hit_list -best ../GATA3_MEME_STREME_motifs/AGATNDWNAGATARN_meme.txt_meme_4.txt CTCF_without_motifs_123.fasta > mast_GATA3_PSWM_in_CTCF_round4.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_CTCF_round4.txt
wc -l mast_GATA3_PSWM_in_CTCF_round4.bed #614
intersectBed -v -a CTCF_without_motifs_123.bed -b mast_GATA3_PSWM_in_CTCF_round4.bed > CTCF_without_motifs_1234.bed
wc -l CTCF_without_motifs_1234.bed #92324
#round5
fastaFromBed -fi $genome -bed CTCF_without_motifs_1234.bed -fo CTCF_without_motifs_1234.fasta
mast -hit_list -best ../GATA3_MEME_STREME_motifs/BTTATCWGATB_meme_5.txt_meme.txt CTCF_without_motifs_1234.fasta > mast_GATA3_PSWM_in_CTCF_round5.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_CTCF_round5.txt
wc -l mast_GATA3_PSWM_in_CTCF_round5.bed #490
intersectBed -v -a CTCF_without_motifs_1234.bed -b mast_GATA3_PSWM_in_CTCF_round5.bed > CTCF_without_motifs_12345.bed
wc -l CTCF_without_motifs_12345.bed #91834
#round6
fastaFromBed -fi $genome -bed CTCF_without_motifs_12345.bed -fo CTCF_without_motifs_12345.fasta
mast -hit_list -best ../GATA3_MEME_STREME_motifs/WGATBDHRVAGATAA_meme.txt_meme_6.txt CTCF_without_motifs_12345.fasta > mast_GATA3_PSWM_in_CTCF_round6.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_CTCF_round6.txt
wc -l mast_GATA3_PSWM_in_CTCF_round6.bed #653
intersectBed -v -a CTCF_without_motifs_12345.bed -b mast_GATA3_PSWM_in_CTCF_round6.bed > CTCF_without_motifs_123456.bed
wc -l CTCF_without_motifs_123456.bed #91181
STREME (mast uses p-value of 0.0005)
#round7
fastaFromBed -fi $genome -bed CTCF_without_motifs_123456.bed -fo CTCF_without_motifs_123456.fasta
mast -mt 0.0005 -hit_list -best ../GATA3_MEME_STREME_motifs/AGATAAM_streme.txt CTCF_without_motifs_123456.fasta > mast_GATA3_PSWM_in_CTCF_round7.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_CTCF_round7.txt
wc -l mast_GATA3_PSWM_in_CTCF_round7.bed #2692
intersectBed -v -a CTCF_without_motifs_123456.bed -b mast_GATA3_PSWM_in_CTCF_round7.bed > CTCF_without_motifs_123456_7.bed
wc -l CTCF_without_motifs_123456_7.bed #88489
#round8
fastaFromBed -fi $genome -bed CTCF_without_motifs_123456_7.bed -fo CTCF_without_motifs_123456_7.fasta
mast -mt 0.0005 -hit_list -best ../GATA3_MEME_STREME_motifs/TGATAA_streme.txt CTCF_without_motifs_123456_7.fasta > mast_GATA3_PSWM_in_CTCF_round8.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_CTCF_round8.txt
wc -l mast_GATA3_PSWM_in_CTCF_round8.bed #1914
intersectBed -v -a CTCF_without_motifs_123456_7.bed -b mast_GATA3_PSWM_in_CTCF_round8.bed > CTCF_without_motifs_123456_78.bed
wc -l CTCF_without_motifs_123456_78.bed #86575
#round9
fastaFromBed -fi $genome -bed CTCF_without_motifs_123456_78.bed -fo CTCF_without_motifs_123456_78.fasta
mast -mt 0.0005 -hit_list -best ../GATA3_MEME_STREME_motifs/AGATDNHATCT_streme.txt CTCF_without_motifs_123456_78.fasta > mast_GATA3_PSWM_in_CTCF_round9.txt
Rscript /home/FCAM/ssun/scripts/parse_mast_to_coordinates.R mast_GATA3_PSWM_in_CTCF_round9.txt
wc -l mast_GATA3_PSWM_in_CTCF_round9.bed #1840
intersectBed -v -a CTCF_without_motifs_123456_78.bed -b mast_GATA3_PSWM_in_CTCF_round9.bed > CTCF_without_motifs_123456_789.bed
wc -l CTCF_without_motifs_123456_789.bed #84735
Randomly subset the CTCF peaks
After the removal of the 9 GATA3 motifs with mast, the CTCF peak set now remains 84735 regions (originally 94738, removed ~10% regions).
Here I am using shuf to randomly select 7800 CTCF peaks to match with the GATA peak subset (quantile1: 7796 peaks) I am going to compare with.
shuf -n 7800 CTCF_without_motifs_123456_789.bed > random_7800_CTCF_without_motifs_123456_789.bed
wc -l random_7800_CTCF_without_motifs_123456_789.bed
head random_7800_CTCF_without_motifs_123456_789.bed
## 7800 random_7800_CTCF_without_motifs_123456_789.bed
## chr18 48951784 48951885 CTCF_ChIP_peak_36395 18.0917
## chr19 3146694 3146795 CTCF_ChIP_peak_37117 150.667
## chr13 51445062 51445163 CTCF_ChIP_peak_20942 238.503
## chr1 206937764 206937865 CTCF_ChIP_peak_6646 4.83363
## chr16 76854379 76854480 CTCF_ChIP_peak_30360 292.784
## chrX 153887154 153887255 CTCF_ChIP_peak_84880a 4.15912
## chr17 63882339 63882440 CTCF_ChIP_peak_33796b 396.571
## chr14 20460965 20461066 CTCF_ChIP_peak_21906 5.58175
## chr9 19085464 19085565 CTCF_ChIP_peak_78691 134.583
## chr13 50126140 50126241 CTCF_ChIP_peak_20902a 12.1923
Read the file in and use the center as regulatory region summit:
library(bigWig)
chip.peak=read.table("random_7800_CTCF_without_motifs_123456_789.bed", header=FALSE)
nrow(chip.peak)
## [1] 7800
head(chip.peak)
## V1 V2 V3 V4 V5
## 1 chr18 48951784 48951885 CTCF_ChIP_peak_36395 18.09170
## 2 chr19 3146694 3146795 CTCF_ChIP_peak_37117 150.66700
## 3 chr13 51445062 51445163 CTCF_ChIP_peak_20942 238.50300
## 4 chr1 206937764 206937865 CTCF_ChIP_peak_6646 4.83363
## 5 chr16 76854379 76854480 CTCF_ChIP_peak_30360 292.78400
## 6 chrX 153887154 153887255 CTCF_ChIP_peak_84880a 4.15912
chip.peak.summit=center.bed(chip.peak, upstreamWindow=0, downstreamWindow=0)
nrow(chip.peak.summit)
## [1] 7800
head(chip.peak.summit)
## V1 V2 V3 V4 V5
## 1 chr18 48951834 48951835 CTCF_ChIP_peak_36395 18.09170
## 2 chr19 3146744 3146745 CTCF_ChIP_peak_37117 150.66700
## 3 chr13 51445112 51445113 CTCF_ChIP_peak_20942 238.50300
## 4 chr1 206937814 206937815 CTCF_ChIP_peak_6646 4.83363
## 5 chr16 76854429 76854430 CTCF_ChIP_peak_30360 292.78400
## 6 chrX 153887204 153887205 CTCF_ChIP_peak_84880a 4.15912
find the closest GAT to summit of unionDHS with closestBed.
all.GAT.file=read.table(file = "../peak_without_123456789/hg38.3.3.3.30_plus_minus_GAT.bed", sep="\t", header=FALSE)
CTCF.ctrl.distance=bedTools.closest(bed1 = chip.peak.summit[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
write.table(CTCF.ctrl.distance,file= 'CTCF.ctrl.distance.bed', quote=F,sep="\t",col.names=F,row.names=F)
save.image('231214_GATA3_ChIP_clsestbed.Rdata')
wc -l CTCF.ctrl.distance.bed
head -5 CTCF.ctrl.distance.bed
awk '{print $4, $5, $6, $7, $8, $9, $10}' CTCF.ctrl.distance.bed | awk '{$1=$1}1' OFS="\t"> closest_1st_GAT_to_CTCF_peak.bed
wc -l closest_1st_GAT_to_CTCF_peak.bed
awk '{print $4, $5, $6, $7, $8, $9, $10}' CTCF.ctrl.distance.bed | awk '{$1=$1}1' OFS="\t" | sort | uniq> closest_1st_GAT_to_CTCF_uniq_peak.bed
wc -l closest_1st_GAT_to_CTCF_uniq_peak.bed
## 7800 CTCF.ctrl.distance.bed
## chr1 826944 826945 chr1 826956 826959 36 36 + GAT 12
## chr1 913010 913011 chr1 912992 912995 14 14 - GAT 16
## chr1 1116689 1116690 chr1 1116512 1116515 14 14 - GAT 175
## chr1 1247245 1247246 chr1 1247393 1247396 36 36 + GAT 148
## chr1 1305399 1305400 chr1 1305387 1305390 14 14 - GAT 10
## 7800 closest_1st_GAT_to_CTCF_peak.bed
## 7788 closest_1st_GAT_to_CTCF_uniq_peak.bed
head -5 closest_1st_GAT_to_CTCF_uniq_peak.bed
awk '$6 == "+" {count++} END {print "Positive strand count:", count}' closest_1st_GAT_to_CTCF_uniq_peak.bed
awk '$6 == "-" {count++} END {print "Positive strand count:", count}' closest_1st_GAT_to_CTCF_uniq_peak.bed
## chr1 10033721 10033724 36 36 + GAT
## chr1 100463958 100463961 14 14 - GAT
## chr1 101357518 101357521 14 14 - GAT
## chr1 101769729 101769732 14 14 - GAT
## chr1 101971482 101971485 14 14 - GAT
## Positive strand count: 4004
## Positive strand count: 3784
bedtools subtract
Subtract the 1st closest GAT from all.GAT, then find the closest 2nd GAT to the closest 1st GAT.
-f Requiring a minimal overlap fraction before subtracting. Here we define -f 1.00 to make sure of a 100% overlap between two file before subtracting.
-s Enforcing same “strandedness” while scanning for features in -b file that should be subtracted from -a file.
module load bedtools
sort -k1,1 -k2,2n ../peak_without_123456789/hg38.3.3.3.30_plus_minus_GAT.bed > hg38.3.3.3.30_plus_minus_GAT.sorted.bed
head -5 hg38.3.3.3.30_plus_minus_GAT.sorted.bed
#chr1 11145 11148 14 14 - GAT
#chr1 11160 11163 14 14 - GAT
#chr1 11521 11524 36 36 + GAT
#chr1 11576 11579 14 14 - GAT
#chr1 12188 12191 36 36 + GAT
sort -k1,1 -k2,2n closest_1st_GAT_to_CTCF_uniq_peak.bed > closest_1st_GAT_to_CTCF_uniq_peak.sorted.bed
head -5 closest_1st_GAT_to_CTCF_uniq_peak.sorted.bed
#chr1 826956 826959 36 36 + GAT
#chr1 912992 912995 14 14 - GAT
#chr1 1116512 1116515 14 14 - GAT
#chr1 1247393 1247396 36 36 + GAT
#chr1 1305387 1305390 14 14 - GAT
bedtools subtract -a hg38.3.3.3.30_plus_minus_GAT.sorted.bed -b closest_1st_GAT_to_CTCF_uniq_peak.sorted.bed -f 1.00 -s > hg38.3.3.3.30_plus_minus_GAT_without_1st_closest_GAT_CTCF.bed
wc -l hg38.3.3.3.30_plus_minus_GAT.sorted.bed #64243047
wc -l closest_1st_GAT_to_CTCF_uniq_peak.sorted.bed #7788
wc -l hg38.3.3.3.30_plus_minus_GAT_without_1st_closest_GAT_CTCF.bed #64235259
#64243047-7788
#[1] 64235259
Read the file (1st closest GAT to CTCF summit) in:
closest_1st_GAT_to_CTCF=read.table("closest_1st_GAT_to_CTCF_uniq_peak.sorted.bed", header=FALSE)
nrow(closest_1st_GAT_to_CTCF)
## [1] 7788
head(closest_1st_GAT_to_CTCF)
## V1 V2 V3 V4 V5 V6 V7
## 1 chr1 826956 826959 36 36 + GAT
## 2 chr1 912992 912995 14 14 - GAT
## 3 chr1 1116512 1116515 14 14 - GAT
## 4 chr1 1247393 1247396 36 36 + GAT
## 5 chr1 1305387 1305390 14 14 - GAT
## 6 chr1 1308225 1308228 14 14 - GAT
Read the GAT coorrdinates file without the 1st closest GAT.
all.GAT.without.1st.closest.ctrl.file=read.table(file = "hg38.3.3.3.30_plus_minus_GAT_without_1st_closest_GAT_CTCF.bed", sep="\t", header=FALSE)
nrow(all.GAT.without.1st.closest.ctrl.file)
#64235259
head(all.GAT.without.1st.closest.ctrl.file)
# V1 V2 V3 V4 V5 V6 V7
#1 chr1 11145 11148 14 14 - GAT
#2 chr1 11160 11163 14 14 - GAT
#3 chr1 11521 11524 36 36 + GAT
#4 chr1 11576 11579 14 14 - GAT
#5 chr1 12188 12191 36 36 + GAT
#6 chr1 13274 13277 36 36 + GAT
Use bedtools closestBed to find the 2nd closest GAT to the 1st closest GAT use the output from bedtools subtract.
closest.2nd.distance.CTCF=bedTools.closest(bed1 = closest_1st_GAT_to_CTCF[,1:3], bed2 = all.GAT.without.1st.closest.ctrl.file, opt.string = '-d -t first')
write.table(closest.2nd.distance.CTCF, file= 'closest.2nd.distance.CTCF.bed', quote=F,sep="\t",col.names=F,row.names=F)
save.image('231214_GATA3_ChIP_2nd_clsestbed.Rdata')
We decided to not use the CTCF ChIP peak as the negative control because it will bias the comparison. Since CTCF peaks will be enriched of CTCF motifs and these binding element will have fewer GAT, ATC…
Goals:
-Look for software capable of searching for motifs with variable spacings.
-Determine the criteria for identifying a GATA3 binding element (e.g., GATA3 can bind to 3mer-3mer instances with a non-fixed spacing range from x to y).
-refine some details: 1) Define “spacing” as the relative position of one zinc finger to another. 2) Consider strand orientation: distinguish between cases where 3mer-3mer instances on the same strand, with cases that one 3mer on the positive (+) and the other on the negative (-) strand.
-Consider the creation of communication-oriented figures (e.g., PWMs of known enriched GATA3 motifs, histograms depicting 3mer-3mer distributions representing motif structure).
Plans:
Prioritize finding software capable of searching for motifs with variable spacings.
Determine negative control:
Extend 200bp to peak summit for local control; generate a CDF of the closest GAT distances and a PDF of the distance between the 2nd GAT to the 1st GAT. Compare this local control against the unionDHS control.
Meanwhile, search for union DHS regions in MCF7 cells from ENCODE.
Display 3mer-3mer distributions for positive controls using histograms (use peak sets with enriched motifs that have known 3mer-3mer structures).
Analyze all 64 3mer bigWigs to find the closest 3mer to peak summit/control summit. Draw CDFs and prioritize a subset of 3mers based on proximity to the GATA3 peak summit compared to the control.
For each prioritized 3mer, identify the second closest 3mer by looping through all 64 3mers. Present histogram/PDF of the distances.
Consolidate the PDFs into a single plot, with each trace representing a 3mer-3mer pair showcasing the relative distance between their zinc finger binding sites compare to negative control.
Use this data we could make criteria for the GATA3 binding element: 1) threshold of spacing; and 2) potential 3mer-3mer combinations.
We could also perform the same analysis on spaced 3mer (i.e., GXTA) (seqOutbias).
Goal: add 200bp to the original peak summit, use this new locus as “control summit” to look for 1st and 2nd closest GAT.
peak summit files in quantiles
These are the peak summit files that are ranked from high intensity to low intensity in 5 quantiles.
for i in *quantile*summits.bed
do
wc -l $i
head -5 $i
done
## 7594 quantile0.2_summits.bed
## chr1 924853 924854 30.4468389666785
## chr1 966653 966654 26.8701605998383
## chr1 999508 999509 21.2245532255625
## chr1 1000536 1000537 24.342633538906
## chr1 1001891 1001892 25.5287170527213
## 7796 quantile0.4_summits.bed
## chr1 916769 916770 36.6999257213148
## chr1 1013265 1013266 34.7022145570112
## chr1 1013580 1013581 36.2692161076724
## chr1 1021142 1021143 32.6995098507957
## chr1 1158192 1158193 36.2174294299052
## 7796 quantile0.6_summits.bed
## chr1 827380 827381 45.324686100056
## chr1 1074184 1074185 48.1888265664752
## chr1 1124966 1124967 45.4661019134305
## chr1 1157747 1157748 44.1928777685044
## chr1 3587455 3587456 51.9193632691721
## 7796 quantile0.8_summits.bed
## chr1 1073822 1073823 56.7025480761926
## chr1 5637335 5637336 63.6099238799638
## chr1 7019683 7019684 72.4024868144889
## chr1 7266189 7266190 70.6631991025767
## chr1 7494212 7494213 75.1725145398337
## 7796 quantile1_summits.bed
## chr1 5598187 5598188 271.931218584276
## chr1 8017660 8017661 676.313615413536
## chr1 8020178 8020179 324.955493506146
## chr1 8055212 8055213 111.558738286523
## chr1 8061475 8061476 98.519268615125
These are all 1bp-centered region of GATA3 peaks without the MEMNE/STREME found motifs.
We will add 200bp to the downstream of these peak summit as the “local control”.
library(bigWig)
#quantile1
quantile1_summits.200bpctrl=center.bed(read.table(file="quantile1_summits.bed", sep="\t", header=FALSE), upstreamWindow=-200, downstreamWindow=200)
head(read.table(file="quantile1_summits.bed",sep="\t", header=FALSE))
## V1 V2 V3 V4
## 1 chr1 5598187 5598188 271.93122
## 2 chr1 8017660 8017661 676.31362
## 3 chr1 8020178 8020179 324.95549
## 4 chr1 8055212 8055213 111.55874
## 5 chr1 8061475 8061476 98.51927
## 6 chr1 8120791 8120792 124.49564
head(quantile1_summits.200bpctrl)
## V1 V2 V3 V4
## 1 chr1 5598387 5598388 271.93122
## 2 chr1 8017860 8017861 676.31362
## 3 chr1 8020378 8020379 324.95549
## 4 chr1 8055412 8055413 111.55874
## 5 chr1 8061675 8061676 98.51927
## 6 chr1 8120991 8120992 124.49564
#write.table(quantile1_summits.200bpctrl, file= 'quantile1_summits.200bpctrl.bed', quote=F,sep="\t",col.names=F,row.names=F)
#quantile0.8
quantile0.8_summits.200bpctrl=center.bed(read.table(file="quantile0.8_summits.bed", sep="\t", header=FALSE), upstreamWindow=-200, downstreamWindow=200)
head(read.table(file="quantile0.8_summits.bed",sep="\t", header=FALSE))
## V1 V2 V3 V4
## 1 chr1 1073822 1073823 56.70255
## 2 chr1 5637335 5637336 63.60992
## 3 chr1 7019683 7019684 72.40249
## 4 chr1 7266189 7266190 70.66320
## 5 chr1 7494212 7494213 75.17251
## 6 chr1 8026313 8026314 69.95015
head(quantile0.8_summits.200bpctrl)
## V1 V2 V3 V4
## 1 chr1 1074022 1074023 56.70255
## 2 chr1 5637535 5637536 63.60992
## 3 chr1 7019883 7019884 72.40249
## 4 chr1 7266389 7266390 70.66320
## 5 chr1 7494412 7494413 75.17251
## 6 chr1 8026513 8026514 69.95015
#write.table(quantile0.8_summits.200bpctrl, file= 'quantile0.8_summits.200bpctrl.bed', quote=F,sep="\t",col.names=F,row.names=F)
#quantile0.6
quantile0.6_summits.200bpctrl=center.bed(read.table(file="quantile0.6_summits.bed", sep="\t", header=FALSE), upstreamWindow=-200, downstreamWindow=200)
head(read.table(file="quantile0.6_summits.bed",sep="\t", header=FALSE))
## V1 V2 V3 V4
## 1 chr1 827380 827381 45.32469
## 2 chr1 1074184 1074185 48.18883
## 3 chr1 1124966 1124967 45.46610
## 4 chr1 1157747 1157748 44.19288
## 5 chr1 3587455 3587456 51.91936
## 6 chr1 6199528 6199529 42.31217
head(quantile0.6_summits.200bpctrl)
## V1 V2 V3 V4
## 1 chr1 827580 827581 45.32469
## 2 chr1 1074384 1074385 48.18883
## 3 chr1 1125166 1125167 45.46610
## 4 chr1 1157947 1157948 44.19288
## 5 chr1 3587655 3587656 51.91936
## 6 chr1 6199728 6199729 42.31217
#write.table(quantile0.6_summits.200bpctrl, file= 'quantile0.6_summits.200bpctrl.bed', quote=F,sep="\t",col.names=F,row.names=F)
#quantile0.4
quantile0.4_summits.200bpctrl=center.bed(read.table(file="quantile0.4_summits.bed", sep="\t", header=FALSE), upstreamWindow=-200, downstreamWindow=200)
head(read.table(file="quantile0.4_summits.bed",sep="\t", header=FALSE))
## V1 V2 V3 V4
## 1 chr1 916769 916770 36.69993
## 2 chr1 1013265 1013266 34.70221
## 3 chr1 1013580 1013581 36.26922
## 4 chr1 1021142 1021143 32.69951
## 5 chr1 1158192 1158193 36.21743
## 6 chr1 1231880 1231881 35.47152
head(quantile0.4_summits.200bpctrl)
## V1 V2 V3 V4
## 1 chr1 916969 916970 36.69993
## 2 chr1 1013465 1013466 34.70221
## 3 chr1 1013780 1013781 36.26922
## 4 chr1 1021342 1021343 32.69951
## 5 chr1 1158392 1158393 36.21743
## 6 chr1 1232080 1232081 35.47152
#write.table(quantile0.4_summits.200bpctrl, file= 'quantile0.4_summits.200bpctrl.bed', quote=F,sep="\t",col.names=F,row.names=F)
#quantile0.2
quantile0.2_summits.200bpctrl=center.bed(read.table(file="quantile0.2_summits.bed", sep="\t", header=FALSE), upstreamWindow=-200, downstreamWindow=200)
head(read.table(file="quantile0.2_summits.bed",sep="\t", header=FALSE))
## V1 V2 V3 V4
## 1 chr1 924853 924854 30.44684
## 2 chr1 966653 966654 26.87016
## 3 chr1 999508 999509 21.22455
## 4 chr1 1000536 1000537 24.34263
## 5 chr1 1001891 1001892 25.52872
## 6 chr1 1020187 1020188 20.63151
head(quantile0.2_summits.200bpctrl)
## V1 V2 V3 V4
## 1 chr1 925053 925054 30.44684
## 2 chr1 966853 966854 26.87016
## 3 chr1 999708 999709 21.22455
## 4 chr1 1000736 1000737 24.34263
## 5 chr1 1002091 1002092 25.52872
## 6 chr1 1020387 1020388 20.63151
#write.table(quantile0.2_summits.200bpctrl, file= 'quantile0.2_summits.200bpctrl.bed', quote=F,sep="\t",col.names=F,row.names=F)
In this section, we use bedtools closestBed (refer to: https://bedtools.readthedocs.io/en/latest/content/tools/closest.html) to find the closest GAT to each provided peak summit.
Input:
-a is the sorted peak summit file (centered 1bp);
-b is the sorted, and concatenated GAT coordinates file (both plus and minus);
Input1 -a: peak: quantile1_summits.bed : peaks without motif 123456789, top20% intensity
for i in *quantile*summits.200bpctrl.bed
do
wc -l $i
head -5 $i
done
## 7594 quantile0.2_summits.200bpctrl.bed
## chr1 925053 925054 30.4468389666785
## chr1 966853 966854 26.8701605998383
## chr1 999708 999709 21.2245532255625
## chr1 1000736 1000737 24.342633538906
## chr1 1002091 1002092 25.5287170527213
## 7796 quantile0.4_summits.200bpctrl.bed
## chr1 916969 916970 36.6999257213148
## chr1 1013465 1013466 34.7022145570112
## chr1 1013780 1013781 36.2692161076724
## chr1 1021342 1021343 32.6995098507957
## chr1 1158392 1158393 36.2174294299052
## 7796 quantile0.6_summits.200bpctrl.bed
## chr1 827580 827581 45.324686100056
## chr1 1074384 1074385 48.1888265664752
## chr1 1125166 1125167 45.4661019134305
## chr1 1157947 1157948 44.1928777685044
## chr1 3587655 3587656 51.9193632691721
## 7796 quantile0.8_summits.200bpctrl.bed
## chr1 1074022 1074023 56.7025480761926
## chr1 5637535 5637536 63.6099238799638
## chr1 7019883 7019884 72.4024868144889
## chr1 7266389 7266390 70.6631991025767
## chr1 7494412 7494413 75.1725145398337
## 7796 quantile1_summits.200bpctrl.bed
## chr1 5598387 5598388 271.931218584276
## chr1 8017860 8017861 676.313615413536
## chr1 8020378 8020379 324.955493506146
## chr1 8055412 8055413 111.558738286523
## chr1 8061675 8061676 98.519268615125
Input2 -b: concatenated (plus and minus) GAT coordinates on full hg38 use read size==30
wc -l hg38.3.3.3.30_plus_minus_GAT.bed #64243047
head -5 hg38.3.3.3.30_plus_minus_GAT.bed
#chr1 11145 11148 14 14 - GAT
#chr1 11160 11163 14 14 - GAT
#chr1 11576 11579 14 14 - GAT
#chr1 13315 13318 14 14 - GAT
#chr1 19797 19800 14 14 - GAT
load files
load files contains 3mer coordinates info
#concatenate the plus and minus file together
all.GAT.file=read.table(file = "hg38.3.3.3.30_plus_minus_GAT.bed", sep="\t", header=FALSE)
head(all.GAT.file)
# V1 V2 V3 V4 V5 V6 V7
#1 chr1 11145 11148 14 14 - GAT
#2 chr1 11160 11163 14 14 - GAT
#3 chr1 11576 11579 14 14 - GAT
#4 chr1 13315 13318 14 14 - GAT
#5 chr1 19797 19800 14 14 - GAT
#6 chr1 27023 27026 14 14 - GAT
tail(all.GAT.file)
# V1 V2 V3 V4 V5 V6 V7
#64243042 chrY_KI270740v1_random 33028 33031 36 36 + GAT
#64243043 chrY_KI270740v1_random 35533 35536 36 36 + GAT
#64243044 chrY_KI270740v1_random 35543 35546 36 36 + GAT
#64243045 chrY_KI270740v1_random 35912 35915 36 36 + GAT
#64243046 chrY_KI270740v1_random 36540 36543 36 36 + GAT
#64243047 chrY_KI270740v1_random 36913 36916 36 36 + GAT
nrow(all.GAT.file)
#[1] 64243047
bedtools closestBed
Parameter -d will report the distance from the closest GAT to the peak summit.
Parameter -t last or -t first will only report either the last or the first entry in bed2 file if tied distance occurred.
Here we choose to use -t first to report the first coordinates when tie occurred.
Find the closest GAT to peak summit regardless of GAT strandedness
# load closestBed function
localctrl.200.distance.quantile1=bedTools.closest(bed1 = quantile1_summits.200bpctrl[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
localctrl.200.distance.quantile0.8=bedTools.closest(bed1 = quantile0.8_summits.200bpctrl[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
localctrl.200.distance.quantile0.6=bedTools.closest(bed1 = quantile0.6_summits.200bpctrl[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
localctrl.200.distance.quantile0.4=bedTools.closest(bed1 = quantile0.4_summits.200bpctrl[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
localctrl.200.distance.quantile0.2=bedTools.closest(bed1 = quantile0.2_summits.200bpctrl[,1:3], bed2 = all.GAT.file, opt.string = '-d -t first')
write.table(localctrl.200.distance.quantile1,file= 'localctrl.200.distance.quantile1.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(localctrl.200.distance.quantile0.8,file= 'localctrl.200.distance.quantile0.8.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(localctrl.200.distance.quantile0.6,file= 'localctrl.200.distance.quantile0.6.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(localctrl.200.distance.quantile0.4,file= 'localctrl.200.distance.quantile0.4.bed', quote=F,sep="\t",col.names=F,row.names=F)
write.table(localctrl.200.distance.quantile0.2,file= 'localctrl.200.distance.quantile0.2.bed', quote=F,sep="\t",col.names=F,row.names=F)
save.image('231218_GATA3_ChIP_clsestbed.Rdata')
negative control 1
In previous analysis, we have used union DHS regions as negative control, and have saved the 1st closest GAT information in file “ctrl.distance.bed”.
Read in the file in R, notice that V1 to V3 are peak summit coordinates, V4 to V9 are closestBed reported closest GAT 3mer to each peak summit. The last column (V11) is the distance from closest motif to peak summit.
ctrl.distance=read.table('ctrl.distance.bed',header=F, comment.char='')
head(ctrl.distance)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
## 1 chr1 190928 190929 chr1 191000 191003 14 14 - GAT 72
## 2 chr1 1183557 1183558 chr1 1183537 1183540 14 14 - GAT 18
## 3 chr1 1207307 1207308 chr1 1207290 1207293 14 14 - GAT 15
## 4 chr1 1548961 1548962 chr1 1548964 1548967 36 36 + GAT 3
## 5 chr1 1666170 1666171 chr1 1666192 1666195 36 36 + GAT 22
## 6 chr1 2298957 2298958 chr1 2299000 2299003 36 36 + GAT 43
nrow(ctrl.distance)
## [1] 7800
nrow(ctrl.distance[ctrl.distance$V9=="+",])
## [1] 3894
nrow(ctrl.distance[ctrl.distance$V9=="-",])
## [1] 3906
negative control2
We have also just created 5 files each with the GATA3 peak summit shifted to downstream 200bp, these file will be used as a local control.
Notice that the first three column is the local control summit (200bp downstream of real peak summit), and column 4 to 10 is the 1st closest GAT coordinates found by closestbed, the last column is the distance.
for i in localctrl.200.distance.quantile*.bed
do
wc -l $i
head -5 $i
done
## 7594 localctrl.200.distance.quantile0.2.bed
## chr1 925053 925054 chr1 925112 925115 36 36 + GAT 59
## chr1 966853 966854 chr1 966791 966794 14 14 - GAT 60
## chr1 999708 999709 chr1 999712 999715 36 36 + GAT 4
## chr1 1000736 1000737 chr1 1000812 1000815 36 36 + GAT 76
## chr1 1002091 1002092 chr1 1002070 1002073 14 14 - GAT 19
## 7796 localctrl.200.distance.quantile0.4.bed
## chr1 916969 916970 chr1 916863 916866 14 14 - GAT 104
## chr1 1013465 1013466 chr1 1013519 1013522 14 14 - GAT 54
## chr1 1013780 1013781 chr1 1013786 1013789 36 36 + GAT 6
## chr1 1021342 1021343 chr1 1021340 1021343 14 14 - GAT 0
## chr1 1158392 1158393 chr1 1158274 1158277 14 14 - GAT 116
## 7796 localctrl.200.distance.quantile0.6.bed
## chr1 827580 827581 chr1 827507 827510 14 14 - GAT 71
## chr1 1074384 1074385 chr1 1074387 1074390 14 14 - GAT 3
## chr1 1125166 1125167 chr1 1125159 1125162 36 36 + GAT 5
## chr1 1157947 1157948 chr1 1157924 1157927 14 14 - GAT 21
## chr1 3587655 3587656 chr1 3587651 3587654 14 14 - GAT 2
## 7796 localctrl.200.distance.quantile0.8.bed
## chr1 1074022 1074023 chr1 1073954 1073957 36 36 + GAT 66
## chr1 5637535 5637536 chr1 5637600 5637603 14 14 - GAT 65
## chr1 7019883 7019884 chr1 7019883 7019886 36 36 + GAT 0
## chr1 7266389 7266390 chr1 7266414 7266417 36 36 + GAT 25
## chr1 7494412 7494413 chr1 7494399 7494402 14 14 - GAT 11
## 7796 localctrl.200.distance.quantile1.bed
## chr1 5598387 5598388 chr1 5598362 5598365 14 14 - GAT 23
## chr1 8017860 8017861 chr1 8017864 8017867 14 14 - GAT 4
## chr1 8020378 8020379 chr1 8020369 8020372 36 36 + GAT 7
## chr1 8055412 8055413 chr1 8055353 8055356 14 14 - GAT 57
## chr1 8061675 8061676 chr1 8061666 8061669 36 36 + GAT 7
Read tne file in R with distance info of local control.
localctrl.200.distance.1=read.table('localctrl.200.distance.quantile1.bed',header=F, comment.char='')
localctrl.200.distance.0.8=read.table('localctrl.200.distance.quantile0.8.bed',header=F, comment.char='')
localctrl.200.distance.0.6=read.table('localctrl.200.distance.quantile0.6.bed',header=F, comment.char='')
localctrl.200.distance.0.4=read.table('localctrl.200.distance.quantile0.4.bed',header=F, comment.char='')
localctrl.200.distance.0.2=read.table('localctrl.200.distance.quantile0.2.bed',header=F, comment.char='')
head(localctrl.200.distance.1)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
## 1 chr1 5598387 5598388 chr1 5598362 5598365 14 14 - GAT 23
## 2 chr1 8017860 8017861 chr1 8017864 8017867 14 14 - GAT 4
## 3 chr1 8020378 8020379 chr1 8020369 8020372 36 36 + GAT 7
## 4 chr1 8055412 8055413 chr1 8055353 8055356 14 14 - GAT 57
## 5 chr1 8061675 8061676 chr1 8061666 8061669 36 36 + GAT 7
## 6 chr1 8120991 8120992 chr1 8120993 8120996 14 14 - GAT 2
nrow(localctrl.200.distance.1)
## [1] 7796
nrow(localctrl.200.distance.1[localctrl.200.distance.1$V9=="+",])
## [1] 3855
nrow(localctrl.200.distance.1[localctrl.200.distance.1$V9=="-",])
## [1] 3941
GATA3 peaks
We have also generated 5 quantile files with 1st closest GAT to GATA3 peak summits info. These GATA3 peaks are peaks without MEME/MOTIF found motifs and have ranked by peak intensity.
Notice that V1 to V3 are peak summit coordinates, V4 to V9 are closestBed reported closest motif to each peak summit. The last column (V11) is the distance from closest motif to peak summit.
all.distance.quantile1=read.table('all.distance.quantile1.bed',header=F, comment.char='')
head(all.distance.quantile1)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
## 1 chr1 5598187 5598188 chr1 5598164 5598167 36 36 + GAT 21
## 2 chr1 8017660 8017661 chr1 8017663 8017666 36 36 + GAT 3
## 3 chr1 8020178 8020179 chr1 8020170 8020173 14 14 - GAT 6
## 4 chr1 8055212 8055213 chr1 8055202 8055205 36 36 + GAT 8
## 5 chr1 8061475 8061476 chr1 8061441 8061444 36 36 + GAT 32
## 6 chr1 8120791 8120792 chr1 8120785 8120788 14 14 - GAT 4
all.distance.quantile0.8=read.table('all.distance.quantile0.8.bed',header=F, comment.char='')
head(all.distance.quantile1)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
## 1 chr1 5598187 5598188 chr1 5598164 5598167 36 36 + GAT 21
## 2 chr1 8017660 8017661 chr1 8017663 8017666 36 36 + GAT 3
## 3 chr1 8020178 8020179 chr1 8020170 8020173 14 14 - GAT 6
## 4 chr1 8055212 8055213 chr1 8055202 8055205 36 36 + GAT 8
## 5 chr1 8061475 8061476 chr1 8061441 8061444 36 36 + GAT 32
## 6 chr1 8120791 8120792 chr1 8120785 8120788 14 14 - GAT 4
all.distance.quantile0.6=read.table('all.distance.quantile0.6.bed',header=F, comment.char='')
head(all.distance.quantile1)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
## 1 chr1 5598187 5598188 chr1 5598164 5598167 36 36 + GAT 21
## 2 chr1 8017660 8017661 chr1 8017663 8017666 36 36 + GAT 3
## 3 chr1 8020178 8020179 chr1 8020170 8020173 14 14 - GAT 6
## 4 chr1 8055212 8055213 chr1 8055202 8055205 36 36 + GAT 8
## 5 chr1 8061475 8061476 chr1 8061441 8061444 36 36 + GAT 32
## 6 chr1 8120791 8120792 chr1 8120785 8120788 14 14 - GAT 4
all.distance.quantile0.4=read.table('all.distance.quantile0.4.bed',header=F, comment.char='')
head(all.distance.quantile1)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
## 1 chr1 5598187 5598188 chr1 5598164 5598167 36 36 + GAT 21
## 2 chr1 8017660 8017661 chr1 8017663 8017666 36 36 + GAT 3
## 3 chr1 8020178 8020179 chr1 8020170 8020173 14 14 - GAT 6
## 4 chr1 8055212 8055213 chr1 8055202 8055205 36 36 + GAT 8
## 5 chr1 8061475 8061476 chr1 8061441 8061444 36 36 + GAT 32
## 6 chr1 8120791 8120792 chr1 8120785 8120788 14 14 - GAT 4
all.distance.quantile0.2=read.table('all.distance.quantile0.2.bed',header=F, comment.char='')
head(all.distance.quantile1)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
## 1 chr1 5598187 5598188 chr1 5598164 5598167 36 36 + GAT 21
## 2 chr1 8017660 8017661 chr1 8017663 8017666 36 36 + GAT 3
## 3 chr1 8020178 8020179 chr1 8020170 8020173 14 14 - GAT 6
## 4 chr1 8055212 8055213 chr1 8055202 8055205 36 36 + GAT 8
## 5 chr1 8061475 8061476 chr1 8061441 8061444 36 36 + GAT 32
## 6 chr1 8120791 8120792 chr1 8120785 8120788 14 14 - GAT 4
combine data
GATA3 peak info + distance + status.
df.chip = data.frame(matrix(nrow = 0, ncol = 5))
colnames(df.chip) = c("V1","V2","v3", "dis", "status")
for (chip.peak in Sys.glob(file.path("./all.distance.quantile*.bed"))) {
print(chip.peak)
quantile.name =strsplit((strsplit(strsplit(chip.peak, "/")[[1]][length(strsplit(chip.peak, "/")[[1]])], 'all.distance.')[[1]][2]), ".bed")[[1]][1]
print(quantile.name)
all.distance = cbind(read.table(chip.peak,header=F, comment.char='')[,c(1:3, 11)], quantile.name)
df.chip = rbind(df.chip, all.distance)
}
## [1] "./all.distance.quantile0.2.bed"
## [1] "quantile0.2"
## [1] "./all.distance.quantile0.4.bed"
## [1] "quantile0.4"
## [1] "./all.distance.quantile0.6.bed"
## [1] "quantile0.6"
## [1] "./all.distance.quantile0.8.bed"
## [1] "quantile0.8"
## [1] "./all.distance.quantile1.bed"
## [1] "quantile1"
str(df.chip)
## 'data.frame': 38778 obs. of 5 variables:
## $ V1 : chr "chr1" "chr1" "chr1" "chr1" ...
## $ V2 : int 924853 966653 999508 1000536 1001891 1020187 1069549 1079599 1163013 1163246 ...
## $ V3 : int 924854 966654 999509 1000537 1001892 1020188 1069550 1079600 1163014 1163247 ...
## $ V11 : int 0 65 61 91 46 143 37 29 213 120 ...
## $ quantile.name: chr "quantile0.2" "quantile0.2" "quantile0.2" "quantile0.2" ...
unique(df.chip$quantile.name)
## [1] "quantile0.2" "quantile0.4" "quantile0.6" "quantile0.8" "quantile1"
colnames(df.chip) = c("V1","V2","V3", "dis", "status")
head(df.chip)
## V1 V2 V3 dis status
## 1 chr1 924853 924854 0 quantile0.2
## 2 chr1 966653 966654 65 quantile0.2
## 3 chr1 999508 999509 61 quantile0.2
## 4 chr1 1000536 1000537 91 quantile0.2
## 5 chr1 1001891 1001892 46 quantile0.2
## 6 chr1 1020187 1020188 143 quantile0.2
union DHS peak info + distance +status.
ctrl.distance=read.table('ctrl.distance.bed',header=F, comment.char='')
df.ctrl = cbind(ctrl.distance[,c(1:3, 11)], "unionDHS_ctrl")
colnames(df.ctrl) = c(colnames(ctrl.distance)[1:3], "dis", "status")
head(df.ctrl)
## V1 V2 V3 dis status
## 1 chr1 190928 190929 72 unionDHS_ctrl
## 2 chr1 1183557 1183558 18 unionDHS_ctrl
## 3 chr1 1207307 1207308 15 unionDHS_ctrl
## 4 chr1 1548961 1548962 3 unionDHS_ctrl
## 5 chr1 1666170 1666171 22 unionDHS_ctrl
## 6 chr1 2298957 2298958 43 unionDHS_ctrl
local control peak info + distance +status.
df.ctrl2 = data.frame(matrix(nrow = 0, ncol = 5))
colnames(df.ctrl2) = c("V1","V2","v3", "dis", "status")
for (chip.peak in Sys.glob(file.path("./localctrl.200.distance.quantile*.bed"))) {
print(chip.peak)
quantile.name =strsplit((strsplit(strsplit(chip.peak, "/")[[1]][length(strsplit(chip.peak, "/")[[1]])], 'localctrl.200.distance.')[[1]][2]), ".bed")[[1]][1]
print(quantile.name)
all.distance = cbind(read.table(chip.peak,header=F, comment.char='')[,c(1:3, 11)], paste0("localctrl_200_", quantile.name))
df.ctrl2 = rbind(df.ctrl2, all.distance)
}
## [1] "./localctrl.200.distance.quantile0.2.bed"
## [1] "quantile0.2"
## [1] "./localctrl.200.distance.quantile0.4.bed"
## [1] "quantile0.4"
## [1] "./localctrl.200.distance.quantile0.6.bed"
## [1] "quantile0.6"
## [1] "./localctrl.200.distance.quantile0.8.bed"
## [1] "quantile0.8"
## [1] "./localctrl.200.distance.quantile1.bed"
## [1] "quantile1"
str(df.ctrl2)
## 'data.frame': 38778 obs. of 5 variables:
## $ V1 : chr "chr1" "chr1" "chr1" "chr1" ...
## $ V2 : int 925053 966853 999708 1000736 1002091 1020387 1069749 1079799 1163213 1163446 ...
## $ V3 : int 925054 966854 999709 1000737 1002092 1020388 1069750 1079800 1163214 1163447 ...
## $ V11 : int 59 60 4 76 19 54 166 34 153 26 ...
## $ paste0("localctrl_200_", quantile.name): chr "localctrl_200_quantile0.2" "localctrl_200_quantile0.2" "localctrl_200_quantile0.2" "localctrl_200_quantile0.2" ...
colnames(df.ctrl2) = c("V1","V2","V3", "dis", "status")
unique(df.ctrl2$status)
## [1] "localctrl_200_quantile0.2" "localctrl_200_quantile0.4"
## [3] "localctrl_200_quantile0.6" "localctrl_200_quantile0.8"
## [5] "localctrl_200_quantile1"
head(df.ctrl2)
## V1 V2 V3 dis status
## 1 chr1 925053 925054 59 localctrl_200_quantile0.2
## 2 chr1 966853 966854 60 localctrl_200_quantile0.2
## 3 chr1 999708 999709 4 localctrl_200_quantile0.2
## 4 chr1 1000736 1000737 76 localctrl_200_quantile0.2
## 5 chr1 1002091 1002092 19 localctrl_200_quantile0.2
## 6 chr1 1020387 1020388 54 localctrl_200_quantile0.2
merge files.
df.all=rbind(df.chip, df.ctrl, df.ctrl2)
df.all$status = factor(df.all$status, levels = c("quantile0.2", "quantile0.4", "quantile0.6", "quantile0.8", "quantile1", "unionDHS_ctrl", "localctrl_200_quantile0.2", "localctrl_200_quantile0.4", "localctrl_200_quantile0.6", "localctrl_200_quantile0.8", "localctrl_200_quantile1" ))
head(df.all)
## V1 V2 V3 dis status
## 1 chr1 924853 924854 0 quantile0.2
## 2 chr1 966653 966654 65 quantile0.2
## 3 chr1 999508 999509 61 quantile0.2
## 4 chr1 1000536 1000537 91 quantile0.2
## 5 chr1 1001891 1001892 46 quantile0.2
## 6 chr1 1020187 1020188 143 quantile0.2
nrow(df.all)
## [1] 85356
We can empirically determine the distance at which the CDFs’ difference between GATA3 peak set and ctrl peak set reach the plateaus (where CDF traces became parallel or converging).
unique(df.all$status)
## [1] quantile0.2 quantile0.4
## [3] quantile0.6 quantile0.8
## [5] quantile1 unionDHS_ctrl
## [7] localctrl_200_quantile0.2 localctrl_200_quantile0.4
## [9] localctrl_200_quantile0.6 localctrl_200_quantile0.8
## [11] localctrl_200_quantile1
## 11 Levels: quantile0.2 quantile0.4 quantile0.6 quantile0.8 ... localctrl_200_quantile1
Previously, comparing between subset of unionDHS regions and GATA3 peak in quantile1, we determined the converging/parallel distance cutoff at ~13bp.
# Specify categories to compare within df.all
match = ecdf(abs(df.all$dis)[df.all$status == 'unionDHS_ctrl'])
rep = ecdf(abs(df.all$dis)[df.all$status == 'quantile1'])
match.y = seq(0, 184, by=1) # creating indices
rep.y = seq(0, 184, by=1)
spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)
print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1]) # 13;
## [1] 13
print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1])
## [1] 13
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
xlim = c(0,60),
cex=0.7,
xlab = '3mer-GAT Distance from peak summit (bp)',
ylab = 'GATA3 peak CDF - ctrl(unionDHS) CDF')
abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
lines(pred[[1]], pred[[2]], col = 'blue')
# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
xlim = c(0,300),
cex=0.7,
xlab = '3mer-GAT Distance from peak summit (bp)',
ylab = 'GATA3 peak CDF - ctrl(unionDHS) CDF')
abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
lines(pred[[1]], pred[[2]], col = 'blue')
Now that we have made local controls, we can determine the cutoff of GATA3 peak in each quantiles compared with its coresponding local ctrl:
quantile1 vs. localctrl_200_quantile1:
match = ecdf(abs(df.all$dis)[df.all$status == 'localctrl_200_quantile1'])
rep = ecdf(abs(df.all$dis)[df.all$status == 'quantile1'])
match.y = seq(0, 184, by=1)
rep.y = seq(0, 184, by=1)
spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)
print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1])
## [1] 16
print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1])
## [1] 16
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
xlim = c(0,60),
cex=0.7,
xlab = '3mer-GAT Distance from peak summit (bp)',
ylab = 'GATA3 peak CDF - ctrl(unionDHS) CDF')
abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
lines(pred[[1]], pred[[2]], col = 'blue')
# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
xlim = c(0,300),
cex=0.7,
xlab = '3mer-GAT Distance from peak summit (bp)',
ylab = 'GATA3 peak CDF - ctrl(unionDHS) CDF')
abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
lines(pred[[1]], pred[[2]], col = 'blue')
quantile0.8 vs. localctrl_200_quantile0.8:
match = ecdf(abs(df.all$dis)[df.all$status == 'localctrl_200_quantile0.8'])
rep = ecdf(abs(df.all$dis)[df.all$status == 'quantile0.8'])
match.y = seq(0, 247, by=1)
rep.y = seq(0, 247, by=1)
spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)
print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1])
## [1] 4
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
xlim = c(0,60),
cex=0.7,
xlab = '3mer-GAT Distance from peak summit (bp)',
ylab = 'GATA3 peak CDF - ctrl(unionDHS) CDF')
abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
lines(pred[[1]], pred[[2]], col = 'blue')
# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
xlim = c(0,300),
cex=0.7,
xlab = '3mer-GAT Distance from peak summit (bp)',
ylab = 'GATA3 peak CDF - ctrl(unionDHS) CDF')
abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
lines(pred[[1]], pred[[2]], col = 'blue')
quantile0.6 vs. localctrl_200_quantile0.6:
match = ecdf(abs(df.all$dis)[df.all$status == 'localctrl_200_quantile0.6'])
rep = ecdf(abs(df.all$dis)[df.all$status == 'quantile0.6'])
match.y = seq(0, 260, by=1)
rep.y = seq(0, 260, by=1)
spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)
print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1])
## [1] 2
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
xlim = c(0,60),
cex=0.7,
xlab = '3mer-GAT Distance from peak summit (bp)',
ylab = 'GATA3 peak CDF - ctrl(unionDHS) CDF')
abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
lines(pred[[1]], pred[[2]], col = 'blue')
# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
xlim = c(0,300),
cex=0.7,
xlab = '3mer-GAT Distance from peak summit (bp)',
ylab = 'GATA3 peak CDF - ctrl(unionDHS) CDF')
abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
lines(pred[[1]], pred[[2]], col = 'blue')
quantile0.4 vs. localctrl_200_quantile0.4:
match = ecdf(abs(df.all$dis)[df.all$status == 'localctrl_200_quantile0.4'])
rep = ecdf(abs(df.all$dis)[df.all$status == 'quantile0.4'])
match.y = seq(0, 270, by=1)
rep.y = seq(0, 270, by=1)
spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)
print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1])
## numeric(0)
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
xlim = c(0,60),
cex=0.7,
xlab = '3mer-GAT Distance from peak summit (bp)',
ylab = 'GATA3 peak CDF - ctrl(unionDHS) CDF')
abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
lines(pred[[1]], pred[[2]], col = 'blue')
# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
xlim = c(0,300),
cex=0.7,
xlab = '3mer-GAT Distance from peak summit (bp)',
ylab = 'GATA3 peak CDF - ctrl(unionDHS) CDF')
abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
lines(pred[[1]], pred[[2]], col = 'blue')
quantile0.2 vs. localctrl_200_quantile0.2:
match = ecdf(abs(df.all$dis)[df.all$status == 'localctrl_200_quantile0.2'])
rep = ecdf(abs(df.all$dis)[df.all$status == 'quantile0.2'])
match.y = seq(0, 360, by=1)
rep.y = seq(0, 360, by=1)
spl = smooth.spline(rep.y, rep(rep.y) - match(match.y))
pred = predict(spl)
pred1 = predict(spl, deriv=1)
print('The distance that CDFs are parallel or converge:')
## [1] "The distance that CDFs are parallel or converge:"
print(rep.y[min(which(pred1$y<=0)) - 1])
## [1] 20
# Plot graph depicting the distance determination
plot(rep.y, rep(rep.y) - match(match.y),
xlim = c(0,60),
cex=0.7,
xlab = '3mer-GAT Distance from peak summit (bp)',
ylab = 'GATA3 peak CDF - ctrl(unionDHS) CDF')
abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
lines(pred[[1]], pred[[2]], col = 'blue')
# this is the same plot as above but with a longer x-axis limit
plot(rep.y, rep(rep.y) - match(match.y),
xlim = c(0,300),
cex=0.7,
xlab = '3mer-GAT Distance from peak summit (bp)',
ylab = 'GATA3 peak CDF - ctrl(unionDHS) CDF')
abline(v = rep.y[min(which(pred1$y<=0)) - 1], col =2, lty =2)
lines(pred[[1]], pred[[2]], col = 'blue')
We are plotting the cumulative distribution function to evaluate whether a 3mer GAT are closer to the summit of GATA3 ChIP peaks (that might use 3mer GAT as a binding sequence) than they are to the summit of a random subset of union DHS regions (unionDHS_ctrl), as well as to the local control (summit shifted to downstream 200bp).
negative control comparison
First, we want to ask if the local control is comparable to the union DHS control?
library(lattice)
library(latticeExtra)
df.nc=rbind(df.ctrl, df.ctrl2)
df.nc$status = factor(df.nc$status, levels = c("unionDHS_ctrl", "localctrl_200_quantile0.2", "localctrl_200_quantile0.4", "localctrl_200_quantile0.6", "localctrl_200_quantile0.8", "localctrl_200_quantile1" ))
ecdfplot(~log(abs(dis), base = 10), groups = status, data = df.nc,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
col=c("blue", colorRampPalette(c("grey85","grey30"))(5)),
aspect = 1,
#xlim = c(0, 50000),
scales=list(relation="free",alternating=c(1,1,1,1)),
ylab = 'Cumulative Distribution Function',
xlab = expression('log'[10]~'3mer-GAT Distance from peak summit'),
#index.cond = list(c(2,1)),
between=list(y=1.0),
type = 'a',
xlim = c(0,2.5),
lwd=2,
lty=c(1),
par.settings = list(superpose.line = list(col=c("blue", colorRampPalette(c("grey85","grey30"))(5)), lwd=3), strip.background=list(col="grey85"))
)
The blue trace is the subset of union DHS regions, it mostly overlap with the local control use peak in quantile1.
single plot for each quantile
quantile1 vs. localctrl_200_quantile1 vs. union DHS control:
library(lattice)
library(latticeExtra)
df1=rbind(df.chip[df.chip$status=="quantile1",], df.ctrl, df.ctrl2[df.ctrl2$status=="localctrl_200_quantile1",])
df1$status = factor(df1$status, levels = c("quantile1", "unionDHS_ctrl", "localctrl_200_quantile1"))
ecdfplot(~abs(dis), groups = status, data = df1,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
col=c("#ce228e", "lightblue", "grey30"),
aspect = 1,
#xlim = c(0, 50000),
scales=list(relation="free",alternating=c(1,1,1,1)),
ylab = 'Cumulative Distribution Function',
xlab = expression('3mer-GAT Distance from peak summit'),
#index.cond = list(c(2,1)),
between=list(y=1.0),
type = 'a',
xlim = c(0,60),
lwd=2,
lty=c(1),
par.settings = list(superpose.line = list(col=c("#ce228e", "lightblue", "grey30"), lwd=3), strip.background=list(col="grey85")),
panel = function(...) {
panel.abline(v= 13, lty =2, col="lightblue")
panel.abline(v= 16, lty =2, col="grey30")
panel.ecdfplot(...)
})
It seems that unionDHS control is mostly overlap with local control within quantile1.
And GATA3 peak in quantile1 is clearly separated (left shift) from both control.
The light blue vertical line at 13 bp is what we empirically determined as the converging point compared between GATA3 peak in quantile 1 and unionDHS control; the grey vertical line at 16bp is what we empirically determined as the converging point compared between GATA3 peak in quantile 1 and local control.
quantile0.8 vs. localctrl_200_quantile0.8 vs. union DHS control:
library(lattice)
library(latticeExtra)
df0.8=rbind(df.chip[df.chip$status=="quantile0.8",], df.ctrl, df.ctrl2[df.ctrl2$status=="localctrl_200_quantile0.8",])
df0.8$status = factor(df0.8$status, levels = c("quantile0.8", "unionDHS_ctrl", "localctrl_200_quantile0.8"))
ecdfplot(~abs(dis), groups = status, data = df0.8,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
col=c("#ce228e", "lightblue", "grey30"),
aspect = 1,
#xlim = c(0, 50000),
scales=list(relation="free",alternating=c(1,1,1,1)),
ylab = 'Cumulative Distribution Function',
xlab = expression('3mer-GAT Distance from peak summit'),
#index.cond = list(c(2,1)),
between=list(y=1.0),
type = 'a',
xlim = c(0,60),
lwd=2,
lty=c(1),
par.settings = list(superpose.line = list(col=c("#ce228e", "lightblue", "grey30"), lwd=3), strip.background=list(col="grey85")),
panel = function(...) {
panel.abline(v= 4, lty =2, col="grey30")
panel.ecdfplot(...)
})
quantile0.6 vs. localctrl_200_quantile0.6 vs. union DHS control:
library(lattice)
library(latticeExtra)
df0.6=rbind(df.chip[df.chip$status=="quantile0.6",], df.ctrl, df.ctrl2[df.ctrl2$status=="localctrl_200_quantile0.6",])
df0.6$status = factor(df0.6$status, levels = c("quantile0.6", "unionDHS_ctrl", "localctrl_200_quantile0.6"))
ecdfplot(~abs(dis), groups = status, data = df0.6,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
col=c("#ce228e", "lightblue", "grey30"),
aspect = 1,
#xlim = c(0, 50000),
scales=list(relation="free",alternating=c(1,1,1,1)),
ylab = 'Cumulative Distribution Function',
xlab = expression('3mer-GAT Distance from peak summit'),
#index.cond = list(c(2,1)),
between=list(y=1.0),
type = 'a',
xlim = c(0,60),
lwd=2,
lty=c(1),
par.settings = list(superpose.line = list(col=c("#ce228e", "lightblue", "grey30"), lwd=3), strip.background=list(col="grey85")),
panel = function(...) {
panel.abline(v= 2, lty =2, col="grey30")
panel.ecdfplot(...)
})
quantile0.4 vs. localctrl_200_quantile0.4 vs. union DHS control:
library(lattice)
library(latticeExtra)
df0.4=rbind(df.chip[df.chip$status=="quantile0.4",], df.ctrl, df.ctrl2[df.ctrl2$status=="localctrl_200_quantile0.4",])
df0.4$status = factor(df0.4$status, levels = c("quantile0.4", "unionDHS_ctrl", "localctrl_200_quantile0.4"))
ecdfplot(~abs(dis), groups = status, data = df0.4,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
col=c("#ce228e", "lightblue", "grey30"),
aspect = 1,
#xlim = c(0, 50000),
scales=list(relation="free",alternating=c(1,1,1,1)),
ylab = 'Cumulative Distribution Function',
xlab = expression('3mer-GAT Distance from peak summit'),
#index.cond = list(c(2,1)),
between=list(y=1.0),
type = 'a',
xlim = c(0,60),
lwd=2,
lty=c(1),
par.settings = list(superpose.line = list(col=c("#ce228e", "lightblue", "grey30"), lwd=3), strip.background=list(col="grey85")),
panel = function(...) {
panel.abline(v= 0, lty =2, col="grey30")
panel.ecdfplot(...)
})
quantile0.2 vs. localctrl_200_quantile0.2 vs. union DHS control:
library(lattice)
library(latticeExtra)
df0.2=rbind(df.chip[df.chip$status=="quantile0.2",], df.ctrl, df.ctrl2[df.ctrl2$status=="localctrl_200_quantile0.2",])
df0.2$status = factor(df0.2$status, levels = c("quantile0.2", "unionDHS_ctrl", "localctrl_200_quantile0.2"))
ecdfplot(~abs(dis), groups = status, data = df0.2,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
col=c("#ce228e", "lightblue", "grey30"),
aspect = 1,
#xlim = c(0, 50000),
scales=list(relation="free",alternating=c(1,1,1,1)),
ylab = 'Cumulative Distribution Function',
xlab = expression('3mer-GAT Distance from peak summit'),
#index.cond = list(c(2,1)),
between=list(y=1.0),
type = 'a',
xlim = c(0,60),
lwd=2,
lty=c(1),
par.settings = list(superpose.line = list(col=c("#ce228e", "lightblue", "grey30"), lwd=3), strip.background=list(col="grey85")),
panel = function(...) {
panel.abline(v= 20, lty =2, col="grey30")
panel.ecdfplot(...)
})
GATA3 peaks in quantile 0.8 to 0.2 looks similar with the corresponding local control. It suggests that these peaks has no significant enrichment of 3mer GAT proximal to their summit compared to control.
all in one plot
library(lattice)
library(latticeExtra)
ecdfplot(~log(abs(dis), base = 10), groups = status, data = df.all,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
col=c(colorRampPalette(c("pink","red"))(4), "#ce228e", "blue", colorRampPalette(c("grey85","grey30"))(5)),
aspect = 1,
#xlim = c(0, 50000),
scales=list(relation="free",alternating=c(1,1,1,1)),
ylab = 'Cumulative Distribution Function',
xlab = expression('log'[10]~'3mer-GAT Distance from peak summit'),
#index.cond = list(c(2,1)),
between=list(y=1.0),
type = 'a',
xlim = c(0,2.5),
lwd=2,
lty=c(1),
par.settings = list(superpose.line = list(col=c(colorRampPalette(c("pink","red"))(4), "#ce228e", "blue", colorRampPalette(c("grey85","grey30"))(5)), lwd=3), strip.background=list(col="grey85")),
panel = function(...) {
panel.abline(v= 1.2, lty =2) #log10(16)=1.2
panel.ecdfplot(...)
})
The above cumulative distribution function (PDF) plot is showing the distance between 3mer-GAT to the summit of GATA3 ChIP peak (red), or summit of the ctrl peak set (blue: random subset from the union DHS regions; grey: local control, downstream 200bp of the proginal peak summit).
The left-shift of the red curve suggests that the 3mer-GAT are closer to the GATA3 ChIP peak set in quantile1 than the ctrl peak set.
The vertical dashed line indicates at which distance (log10 of 16bp) the two CDF traces became parallel. This suggests that the 3mer-GAT often binds within 16 bp of the GATA3 peak summit.
ecdfplot(~abs(dis), groups = status, data = df.all,
auto.key = list(space = "right", lines=TRUE, points=FALSE, cex = 1),
col=c(colorRampPalette(c("pink","red"))(4), "#ce228e", "blue", colorRampPalette(c("grey85","grey30"))(5)),
aspect = 1,
#xlim = c(0, 50000),
scales=list(relation="free",alternating=c(1,1,1,1)),
ylab = 'Cumulative Distribution Function',
xlab = expression('3mer-GAT Distance from peak summit'),
#index.cond = list(c(2,1)),
between=list(y=1.0),
type = 'a',
xlim = c(0,60),
lwd=2,
lty=c(1),
par.settings = list(superpose.line = list(col=c(colorRampPalette(c("pink","red"))(4), "#ce228e", "blue", colorRampPalette(c("grey85","grey30"))(5)), lwd=3), strip.background=list(col="grey85")),
panel = function(...) {
panel.abline(v= 16, lty =2)
panel.ecdfplot(...)
})
The difference between this CDF with the previous one is: The above one is plotting actual distance in bp, the previous one is plotting log10 of the distance. The two CDF plots are essentially communicating the same thing.
In this section, we need to consider two things:
One is to define spacing/distance as the relative position of two zinc finger;
the other is to separate the cases with different strand orientation.
negative control1
In previous analysis, we used union DHS regions as the negative control for GATA3 peaks to compare against.
wc -l closest_1st_GAT_to_unionDHS_uniq_peak.sorted.bed
head -5 closest_1st_GAT_to_unionDHS_uniq_peak.sorted.bed
## 7800 closest_1st_GAT_to_unionDHS_uniq_peak.sorted.bed
## chr1 191000 191003 14 14 - GAT
## chr1 1183537 1183540 14 14 - GAT
## chr1 1207290 1207293 14 14 - GAT
## chr1 1548964 1548967 36 36 + GAT
## chr1 1666192 1666195 36 36 + GAT
negative control2
We have also generated 5 local controls in intensity-ranked quantiles, each with the GATA3 peak summit shifted to downstream 200bp.
Notice that the first three column is the local control summit (200bp downstream of real peak summit), and column 4 to 10 is the 1st closest GAT coordinates found by closestbed, the last column is the distance.
We can sort and select only the 3mer coordinates info to new file.
for i in localctrl.200.distance.quantile*.bed
do
nm=$(echo $i | awk -F "localctrl.200.distance." '{print $2}' | awk -F".bed" '{print $1}')
echo $nm
awk '{print $4, $5, $6, $7, $8, $9, $10}' $i | awk '{$1=$1}1' OFS="\t" | sort | uniq > closest_1st_GAT_to_localctrl_200_uniq_${nm}.sorted.bed
wc -l closest_1st_GAT_to_localctrl_200_uniq_${nm}.sorted.bed
head -5 closest_1st_GAT_to_localctrl_200_uniq_${nm}.sorted.bed
done
## quantile0.2
## 7569 closest_1st_GAT_to_localctrl_200_uniq_quantile0.2.sorted.bed
## chr1 1000812 1000815 36 36 + GAT
## chr1 1002070 1002073 14 14 - GAT
## chr1 100351698 100351701 14 14 - GAT
## chr1 10035708 10035711 36 36 + GAT
## chr1 101026001 101026004 36 36 + GAT
## quantile0.4
## 7791 closest_1st_GAT_to_localctrl_200_uniq_quantile0.4.sorted.bed
## chr1 100249990 100249993 36 36 + GAT
## chr1 101004440 101004443 14 14 - GAT
## chr1 1013519 1013522 14 14 - GAT
## chr1 1013786 1013789 36 36 + GAT
## chr1 10210411 10210414 36 36 + GAT
## quantile0.6
## 7784 closest_1st_GAT_to_localctrl_200_uniq_quantile0.6.sorted.bed
## chr1 100038358 100038361 14 14 - GAT
## chr1 10033297 10033300 14 14 - GAT
## chr1 101371565 101371568 14 14 - GAT
## chr1 102772572 102772575 36 36 + GAT
## chr1 10599439 10599442 14 14 - GAT
## quantile0.8
## 7781 closest_1st_GAT_to_localctrl_200_uniq_quantile0.8.sorted.bed
## chr1 10514921 10514924 36 36 + GAT
## chr1 1073954 1073957 36 36 + GAT
## chr1 107683428 107683431 14 14 - GAT
## chr1 107688373 107688376 14 14 - GAT
## chr1 107688777 107688780 14 14 - GAT
## quantile1
## 7792 closest_1st_GAT_to_localctrl_200_uniq_quantile1.sorted.bed
## chr1 100486336 100486339 14 14 - GAT
## chr1 10083454 10083457 36 36 + GAT
## chr1 101798210 101798213 36 36 + GAT
## chr1 102941895 102941898 14 14 - GAT
## chr1 10511265 10511268 14 14 - GAT
GATA3 peaks
In previous analysis we have called the closest GAT to GATA3 peaks (without MEME/STREME motifs) and saved them in 5 quantile files ranked by their intensity.
for i in closest_1st_GAT_to_GATA_uniq_peak_quantile*.bed
do
wc -l $i
head -5 $i
done
## 7563 closest_1st_GAT_to_GATA_uniq_peak_quantile0.2.bed
## chr1 1000627 1000630 36 36 + GAT
## chr1 1001937 1001940 36 36 + GAT
## chr1 100351435 100351438 36 36 + GAT
## chr1 10035675 10035678 36 36 + GAT
## chr1 101025693 101025696 14 14 - GAT
## 7781 closest_1st_GAT_to_GATA_uniq_peak_quantile0.4.bed
## chr1 100249748 100249751 36 36 + GAT
## chr1 101004290 101004293 14 14 - GAT
## chr1 1013247 1013250 36 36 + GAT
## chr1 1013584 1013587 36 36 + GAT
## chr1 10210228 10210231 36 36 + GAT
## 7785 closest_1st_GAT_to_GATA_uniq_peak_quantile0.6.bed
## chr1 100038217 100038220 14 14 - GAT
## chr1 10033076 10033079 14 14 - GAT
## chr1 101371391 101371394 36 36 + GAT
## chr1 102772386 102772389 36 36 + GAT
## chr1 10599290 10599293 36 36 + GAT
## 7781 closest_1st_GAT_to_GATA_uniq_peak_quantile0.8.bed
## chr1 10514718 10514721 36 36 + GAT
## chr1 1073808 1073811 36 36 + GAT
## chr1 107683212 107683215 36 36 + GAT
## chr1 107688147 107688150 36 36 + GAT
## chr1 107688516 107688519 36 36 + GAT
## 7791 closest_1st_GAT_to_GATA_uniq_peak_quantile1.bed
## chr1 100486183 100486186 14 14 - GAT
## chr1 10083257 10083260 36 36 + GAT
## chr1 101797999 101798002 14 14 - GAT
## chr1 102941741 102941744 14 14 - GAT
## chr1 10511140 10511143 36 36 + GAT
Stam_MCF-7_1: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM736581
Stam_MCF-7_2: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM736588
GSE29692: “Cells were grown according to the approved ENCODE cell culture protocols. Digital DNaseI was performed by DNaseI digestion of intact nuclei, isolating DNaseI ‘double-hit’ fragments as described in Sabo et al. (2006), and direct sequencing of fragment ends (which correspond to in vivo DNaseI cleavage sites) using the Solexa platform (36 bp reads). Uniquely mapping high-quality reads were mapped to the genome. DNaseI sensitivity is directly reflected in raw tag density (Raw Signal), which is shown in the track as density of tags mapping within a 150 bp sliding window (at a 20 bp step across the genome). DNaseI sensitive zones (HotSpots) were identified using the HotSpot algorithm described in Sabo et al. (2004). 1.0% false discovery rate thresholds (FDR 0.01) were computed for each cell type by applying the HotSpot algorithm to an equivalent number of random uniquely mapping 36mers. DNaseI hypersensitive sites (DHSs or Peaks) were identified as signal peaks within FDR 1.0% hypersensitive zones using a peak-finding algorithm.”
Notice that these data are using hg19 genome, we need to use UCSC liftover to convert data to hg38.
There are 9 GATA3 peak sets each with 1 known enriched GATA3-like motif found by MEME/STREME.
sys.glob()